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Abstract 

A  channel  compensation  method  is  sought  for  use  in  speaker  identification  (ID)  and 
verification  applications  under  matched  and  mismatched  training  and  testing  conditions. 
This  work  expands  on  previous  work  on  matched  conditions  by  investigating  three  tech¬ 
niques  on  matched  and  mismatched  conditions  using  the  TIMIT  and  NTIMIT  speech 
databases.  First,  previous  results  on  168  speakers  are  reproduced  for  matched  conditions 
using  Gaussian  mixture  models  (GMM)  and  mel-frequency  cepstral  coefiicients.  Next,  cep- 
stral  mean  subtraction  with  band  limiting  (CMSBL)  is  investigated.  The  third  method, 
developed  in  this  thesis,  uses  a  modified  Wiener  filtering  approach  to  channel  compen¬ 
sation.  New  GMMs  are  created  for  each  method.  The  first  approach  is  then  expanded 
to  include  all  630  TIMIT  and  NTIMIT  speakers  for  speaker  verification.  For  speaker  ID 
under  matched  conditions,  the  CMSBL  method  had  three  more  errors  than  no  additional 
preprocessing  but  yielded  the  best  ID  results  for  the  mismatch  case  with  27.4%  correct. 
Additionally,  the  CMSBL  method  yielded  the  best  verification  results  with  an  equal  error 
rate  of  approximately  0.26%  for  matched  conditions  on  TIMIT  and  approximately  19.6% 
for  mismatched  conditions  on  NTIMIT. 
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SPEAKER  VERIFICATION  IN  THE  PRESENCE  OF 


CHANNEL  MISMATCH  USING 
GAUSSIAN  MIXTURE  MODELS 

I.  Introduction 

1.1  Background 

What  is  speaker  verification  in  the  presence  of  channel  mismatch?  Speaker  verifica¬ 
tion  is  related  to  the  process  of  speaker  identification  (ID),  also  known  as  speaker  recogni¬ 
tion  [6],  where  a  machine  attempts  to  determine  which  speaker,  out  of  a  group  of  registered 
speakers,  a  recorded  portion  of  speech  came  from.  Verification,  on  the  other  hand,  starts 
with  a  person  claiming  a  particular  identity.  The  machine  must  then  determine  whether 
or  not  the  speaker  is  who  they  claim  to  be.  Proper  verification  is  critical  for  controlling 
access  to  sensitive  information  or  special  areas.  Just  as  people  have  difficulty  identifying 
others  over  the  telephone,  computers  have  difficulty  correctly  identifying  or  verifying  peo¬ 
ple  when  speech  is  recorded  differently  than  what  the  computer  was  trained  with.  This 
difference  in  training  and  testing  conditions  is  channel  mismatch.  While  Gaussian  mixture 
models  using  Mel-frequency  cepstral  coefficients  have  had  considerable  success  for  similar 
training  and  testing  conditions  [14],  currently  no  method  adequately  handles  the  cases 
where  training  and  testing  conditions  are  different.  Consequently,  a  solution  for  speaker 
verification  in  the  presence  of  channel  mismatch  is  needed. 

1.2  Problem  Statement 

Develop  a  channel  compensation  method  for  speaker  identification  (ID)  and  speaker 
verification  that  compensates  for  the  channel  mismatch  between  training  and  testing  con¬ 
ditions. 
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1.3  Assumptions 

In  order  to  facilitate  development,  this  thesis  assumes  that  all  training  collections 
are  cooperative.  For  this  thesis,  cooperative  means  that  a  speaker’s  voice  was  recorded 
with  their  knowledge  under  controlled  conditions,  and  that  the  speaker  made  no  conscious 
attempt  to  alter  their  voice  to  sound  like  one  of  the  other  speakers  or  someone  other  than 
themselves.  For  telephone  quality  speech,  the  utterance  passed  to  the  system  was  modified 
only  by  the  telephone  channel.  Due  to  the  nature  of  the  databases,  a  subject  is  assumed 
to  be  in  the  same  physical  and  emotional  states  during  training  and  testing. 

1.4  Scope  and  Research  Objectives 

The  TIMIT  and  NTIMIT  databases  are  used  to  compare  and  analyze  previous  spec¬ 
tral  processing  methods  to  a  new  filtering  approach.  The  ability  of  processing  methods 
for  speaker  ID  and  verification  in  channel  mismatch  conditions  will  be  experimentally 
evaluated.  Towards  that  end,  the  following  are  the  desired  research  objectives: 

•  Reproduce  previous  speaker  ID  results  using  Gaussian  mixture  models  (GMMs)  on 
TIMIT  and  NTIMIT 

•  Reproduce  previous  speaker  verification  results  using  GMMs  under  matched  condi¬ 
tions  using  TIMIT  and  under  channel  mismatch  conditions  using  NTIMIT 

•  Extend  previous  speaker  verification  results  to  a  larger  population 

•  Experimentally  evaluate  the  effect  of  a  modified  Wiener  filter  preprocessor  for  speaker 
ID  and  verification  tasks  on  TIMIT  and  NTIMIT 

1.5  Organization 

The  remainder  of  the  thesis  is  divided  into  four  chapters.  Chapter  II  presents  the 
theory  behind  using  Mel-frequency  cepstral  coefficients  (MFCCs)  and  Gaussian  mixture 
models  (GMMs)  when  performing  speaker  identification  and  speaker  verification.  Chapter 
II  also  outlines  the  channel  compensation  techniques  of  cepstral  mean  subtraction  and 
the  modified  Wiener  filtering  technique.  Chapter  III  outlines  the  computer  equipment  and 
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software  and  how  they  were  used  in  calculating  the  results.  Chapter  IV  presents  the  results 
of  the  thesis  broken  down  by  identification  or  verification  and  then  further  subdivided  by 
corpus  and  method.  Chapter  V  highlights  the  conclusions  drawn  from  the  research  and 
includes  suggestions  for  future  study. 
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II.  Theory 


2. 1  Introduction 

This  chapter  outlines  the  theory  behind  speaker  identification  and  verification  using 
Mel- frequency  cepstral  coefficients  (MFCCs)  and  Gaussian  mixture  models  (GMMs).  First, 
motivation  for  using  Mel-frequency  cepstral  coefficients  is  provided,  and  then  the  basics 
of  how  to  generate  them  is  outlined.  Next,  the  theory  behind  Gaussian  mixture  models 
and  their  training  is  explained.  The  following  sections  then  explain  the  theory  for  speaker 
identification  (ID)  and  the  steps  involved  for  speaker  verification.  The  chapter  also  outlines 
the  theory  behind  two  channel  compensation  techniques,  the  commonly  used  cepstral  mean 
subtraction  [6, 9, 15]  and  a  new  modified  Wiener  filter  approach. 

2.2  Mel-frequency  Cepstral  Coefficients  (MFCCs) 

2.2.1  Motivation  for  using  MFCCs.  MFCCs  provide  a  compact  representation  for 
modeling  an  individual’s  vocal  tract  by  separating  it  from  the  pitch  information  through 
homomorphic  deconvolution  [6].  This  has  the  added  benefit  of  lessening  linear  time- 
invariant  channel  effects  [6].  Using  homomorphic  deconvolution  is  based  on  the  premise 
that  the  vocal  tract  can  be  modeled  as  a  linear  time  invariant  filter  [10]. 

2.2.2  Creating  MFCCs.  The  feature  vectors  used  in  this  thesis  are  MFCCs 
computed  for  windowed  segments  of  each  utterance.  This  multi-step  process  is  begun  by 
fixing  window  and  step  sizes.  The  window  size  determines  the  duration  of  a  segment  of 
the  utterance  to  consider.  The  step  size  indicates  how  far  to  shift  the  window  along  the 
duration  of  an  utterance  from  the  beginning  of  the  previous  window.  The  calculation  of 
an  MFCC  vector  begins  by  taking  a  20  ms  window  of  an  utterance  [12]  and  determining 
whether  it  contains  voiced  speech  or  not.  If  the  window  contains  voiced  speech,  preemphasis 
is  performed  using  the  common  preemphasis  coefficient  of  0.97  [18].  Next,  the  magnitude 
of  the  discrete  Fourier  transform  (DFT)  is  calculated,  and  a  triangular  filterbank  is  placed 
across  the  spectrum.  The  filters  are  placed  such  that  the  beginning  of  the  next  filter  is 
at  the  center  frequency  of  the  previous  filter.  Figure  2.1  illustrates  how  the  filters  divide 
the  frequency  spectrum.  The  log  of  the  magnitude  of  the  triangular  filter  outputs  are 
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Spectrum  with  23  Triangular  Filters  for  MFCCs 


then  calculated  and  a  discrete  cosine  transform  performed  with  the  resulting  coefficients 
forming  the  MFCC  vector  for  the  given  time  window  [5].  Next,  the  time  window  is  shifted 
along  the  utterance’s  duration  by  the  step  size,  a  common  value  is  10  ms  [12].  The  entire 
process  is  then  repeated  to  until  a  window  contains  the  end  of  the  utterance. 

While  the  concepts  for  generating  MFCCs  are  common  among  references,  there  is 
some  variation  in  the  actual  calculations  [5,11,18].  For  this  thesis,  the  method  of  [18] 
was  used.  Using  this  method,  MFCCs  are  calculated  as 


Cfc 


for  k  =  1  . . .  K  . 


(2.1) 
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In  equation  2.1,  is  the  number  of  triangular  filters,  K  is  the  total  number  of  coefficients 
desired,  and  rrij  is  the  log  filterbank  amplitude  for  the  filter  from  the  filterbank  along 
the  mel  scale.  The  mel  scale  is  based  on  experiments  on  human  hearing  that  suggest  filters 
spaced  approximately  linearly  below  1000  Hz  and  logarithmically  above  1000  Hz  [10].  The 
mel  scale  may  be  defined  as  [18] 

Melif)  =  2595  log  (l  +  .  (2.2) 

2.3  Gaussian  Mixture  Models  (GMMs) 

Once  the  MFCCs  have  been  calculated,  they  are  used  as  feature  vectors  for  classifi¬ 
cation  in  determining  speaker  identification  and  speaker  verification. 

2.3.1  Theory  of  GMMs.  A  GMM  is  a  parametric  model  consisting  of  a  weighted 
sum  of  component  Gaussian  densities.  The  model,  A^,  for  a  given  speaker  s  is  defined  as 
a  function  of  the  parameters  Pj,  /Ti,  and  Sj  such  that 

Aa  =  {Pi,^i,Si}  fori  =  l...M.  (2.3) 

The  density  of  a  P-dimensional  feature  vector,  x,  from  a  sampled  window  from  a  given 
speaker  s  can  be  described  by 

M 

p{x  I  A,)  =  ^  Pibiix),  (2.4) 

i=l 

where  M  is  the  number  of  mixtures  and  Pj  is  the  probability  or  weight  of  component  i 
such  that  Pi  —  1-  The  second  term,  6i(-)5  is  the  density  of  the  Gaussian  component 
i  and  is  found  by 


bi{x)  = 


(27r)^/2[Si]i/2 


exp 


(2.5) 


where  Ej  is  the  covariance  matrix  (assumed  to  be  diagonal  [14])  and  /I)  is  the  mean  vector 
for  component  i.  If  N  is  the  number  of  samples  in  an  utterance,  the  likelihood  an  utterance 
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came  from  a  given  GMM  can  be  found  from  [4] 


N 

^0  I  A,)  =  n  Pi^n  I  A,),  (2.6) 

n=l 

which  leads  to  the  log-likelihood 


N 

log/:{U  I  As)  =  ^  logp(x„  I  As). 

n=l 


(2.7) 


Using  Equation  2.4  and  Equation  2.7,  the  log-likelihood  of  an  utterance  given  a  specific 
GMM  is  given  as 


.  N  M 

log  C{U  I  As)  =  -  52logE^A(fn). 


(2.8) 


n=l 


i=l 


The  first  term  on  the  right  hand  side  of  Equation  2.8  is  used  to  normalize  the  likelihood 
scores  so  that  the  scores  are  independent  of  the  number  of  voiced  segments  in  a  given 
utterance. 


2.S.2  Training  GMMs.  In  order  to  train  the  models,  a  prototype  model  is  created 
that  has  the  desired  number  of  means  and  variances  for  a  single  component.  Initially, 
the  vectors  of  means  are  set  to  zero  and  the  variance  vectors  are  set  to  one.  Next,  a 
speaker  is  selected,  and  the  speaker’s  utterances  are  converted  into  the  previously  defined 
MFCC  feature  vectors.  These  feature  vectors  are  then  used  to  estimate  by  expectation 
maximization  (EM  or  Baum-Welch  Reestimation)  [4]  the  means  and  variances.  The 
model  is  “grown”  to  the  desired  number  of  component  densities  using  a  binary  splitting 
algorithm.  This  algorithm  takes  the  component  with  the  greatest  weight  Pj  and  creates 
two  new  components  of  half  the  original  weight.  The  means  are  the  result  of  adding 
or  subtracting  one  standard  deviation  [18].  Additional  iterations  of  the  EM  algorithm 
are  then  performed.  The  feature  vectors  for  a  given  speaker’s  training  utterances  are 
repeatedly  presented  to  the  speaker’s  GMM  until  the  desired  number  of  components,  32 
[11],  is  reached. 


2-4  Speaker  Identification 


After  a  separate  GMM  has  been  trained  for  each  speaker  in  the  system,  attempts 
at  speaker  identification  (ID)  may  begin.  For  speaker  ID,  a  speaker  is  prompted  to  utter 
some  phrase  which  is  then  recorded.  After  determining  voiced  and  unvoiced  portions 
of  the  utterance,  MFCCs  are  generated  from  the  voiced  portions  of  the  utterance.  The 
resulting  MFCC  feature  vectors  are  then  presented  to  the  GMMs  for  each  speaker  currently 
registered  in  the  system.  The  model  with  the  highest  score  for  a  given  utterance  is  believed 
to  belong  to  the  person  who  gave  the  utterance.  Mathematically  this  can  be  represented 
as  selecting  speaker  r  from  the  set  S  of  all  possible  speakers  for  a  given  utterance  U  such 
that 

r  =  arg  m^  |log  [£(17| Aj)j  }  .  (2.9) 


2. 5  Speaker  Verification 

2.5.1  Introduction.  While  speaker  ID  is  generally  a  closed  set  problem  (only 
speakers  registered  in  the  system  are  allowed  to  test),  applications  for  speaker  verification 
may  be  open  set  problems  (speakers  not  registered  in  the  system  may  test).  For  open  set 
problems,  it  is  not  enough  to  know  which  registered  speaker’s  model  scored  the  highest. 
The  system  must  determine  whether  the  speaker  is  really  who  he  or  she  claims  to  be  (the 
claimed  speaker  is  referred  to  as  the  claimant)  by  comparing  the  verification  score  to  some 
absolute  threshold. 

2.5.2  Cohort  Selection.  In  order  to  determine  whether  the  speaker  is  the  claimant 
or  not,  an  additional  preprocessing  step  must  be  made  after  all  of  the  GMMs  are  created. 
This  additional  step  is  the  selection  of  speaker  cohorts.  The  use  of  cohorts  is  necessary  to 
normalize  the  log  probabilities  to  minimize  the  effects  of  stress  and  the  natural  variability 
in  any  given  speaker’s  utterances  [7] .  Cohorts  are  chosen  from  among  all  of  the  registered 
speakers  as  those  that  appear  most  like,  called  close  cohorts,  and  least  like,  called  far 
cohorts,  the  claimant.  To  select  the  cohorts,  one  utterance  from  the  speaker  and  one 
utterance  from  each  of  the  other  registered  speakers  is  required. 
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Cohorts  may  be  selected  from  speakers  that  are  more  similar  or  least  similar  to  a 
given  speaker  based  on  a  symmetric  distortion  score  [14] .  This  score  is  determined  by 


dsymi^i^  ^j)  —  log 


C{Ui\Xi) 

C{Ui\Xj) 


+  log 
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where  is  an  utterance  from  speaker  x  and  is  the  GMM  for  speaker  x.  A  distortion 
score  is  determined  for  a  given  speaker  and  all  the  remaining  registered  speakers.  The 
resulting  scores  are  then  sorted  in  ascending  order.  An  equal  number  of  maximally  spread 
close  cohorts  and  maximally  spread  far  cohorts  are  then  chosen  as  reference  speakers  for 
each  speaker  according  to  [11].  Detailed  steps  for  determining  cohorts  can  be  found  in 
Appendix  A. 


2.5.3  Verification.  Verifying  a  given  speaker  is  the  claimant  is  a  multi-step  pro¬ 
cess.  First,  the  speaker  is  prompted  to  say  some  phrase  and  the  utterance,  U,  is  recorded. 
Next,  any  desired  filtering,  such  as  voiced/unvoiced  determination  and  preemphasis,  is 
performed  on  the  utterance.  Third,  the  utterance  is  converted  to  MFCCs.  The  resulting 
MFCC  feature  vectors  are  then  presented  to  the  claimant’s  GMM  and  the  likelihood  that 
the  utterance  was  made  by  the  claimant  calculated.  The  MFCCs  are  then  presented  to 
the  GMMs  for  each  of  the  claimant’s  cohorts.  A  verification  score  is  obtained  according  to 


v{U)  ^  log  [£(t?]Ae)] 


I  5]  log  [£(^1A,)], 


sen 
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where  U  is  the  utterance,  is  the  claimant’s  cohort  set,  B  is  the  the  number  of  cohorts,  X^ 
is  the  appropriate  GMM  for  speaker  x,  and  x  G  {{c}  U  {s  ]  s  G  fi}}.  If  v{U)  is  greater  than 
or  equal  to  some  predetermined  threshold,  the  speaker’s  utterance  is  accepted  as  having 
come  from  the  claimant.  Otherwise,  the  speaker  is  rejected  as  the  claimant. 


2.6  Channel  Compensation  Techniques 

2.6.1  Cepstral  Mean  Subtraction.  Cepstral  mean  subtraction  (CMS)  has  been 
found  to  be  helpful  in  reducing  the  effects  caused  by  different  channel  conditions  such  as 
different  types  of  microphones  or  actual  phone  channels  [16].  Performing  CMS  on  a  set 
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of  MFCC  feature  vectors  requires  determining  the  mean  of  the  set  of  feature  vectors  and 
subtracting  the  mean  from  the  individual  feature  vectors  in  the  set  prior  to  determining  a 
likelihood  score.  The  result  is  normalized  feature  vectors  that  lessen  the  effect  of  a  given 
channel. 


2.6.2  Modified  Wiener  Filtering.  A  non-causal  Wiener  filter,  assuming  a  zero 
mean  signal  and  noise,  can  be  represented  as  [8] 


H{u) 
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(2.12) 


where  the  power  spectral  densities  of  the  desired  signal,  Psioj),  and  the  noise  signal,  Pn{ijj), 
are  known  a  priori.  For  this  thesis,  however,  Ps{io)  and  Pn{<.^)  are  not  known  a  priori  and 
may  vary  throughout  the  duration  of  utterance.  An  initial  approximation  of  the  expected 
Ps{<jS)  is  made  so  that  the  modified  Wiener  filter  may  be  modeled  as 


Pcjuj) 

Ps(uj)  +  Pn{uj)' 


(2.13) 


where  Pc{<^)  is  determined  by  averaging  the  DFT  from  window  size  segments  of  each  of 
the  claimant’s  training  utterances.  Since  it  is  impossible  to  separately  determine  P«(a') 
and  Pn(w)  in  a  real-world  scenario,  the  sum  of  these  terms  may  be  approximated  by 
the  measured  power  spectral  density  Pu(w)  so  that  the  modified  Wiener  filter  may  be 
approximated  as 

(2.14) 

where  there  are  no  cross  terms  if  the  desired  signal  and  the  noise  are  statistically  indepen¬ 
dent.  By  manipulating  the  terms.  Equation  (  2.14)  becomes 
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(2.15) 


and  the  filter  becomes  a  function  of  a  claimant’s  average  spectrum  and  the  spectrum  of 
the  utterance  being  considered. 
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The  impulse  response,  hmw{'f^)i  is  calculated  from  the  inverse  DFT  of  and 

convolved  with  the  original  utterance  to  obtain  a  new  utterance  U.  The  new  utterance 
is  then  broken  into  voiced  and  unvoiced  segments,  and  the  MFCCs  are  calculated  for  the 
voiced  segments.  These  feature  vectors  are  then  presented  to  the  appropriate  GMMs  for 
identification  or  verification. 

2. 7  Summary 

This  chapter  presented  the  theoretical  background  for  the  speaker  ID  and  verifica¬ 
tion  problems.  First,  the  generation  of  MFCC  feature  vectors  was  described.  Next,  a 
mathematical  basis  for  GMMs  was  given.  A  brief  outline  was  then  given  on  how  MFCCs 
and  GMMs  are  used  to  classify  utterances  for  determining  or  verifying  a  speaker’s  identity. 
Finally,  two  techniques  for  channel  compensation,  CMSBL  and  modified  Wiener  filtering, 
were  presented. 
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III.  Approach  and  Methods 


3.1  Introduction 

In  this  chapter,  brief  explanations  of  the  equipment,  software,  and  speaker  databases 
used  are  provided.  Additionally,  the  procedures  used  for  determining  a  baseline  for  speaker 
identification  and  speaker  verification  are  given.  The  steps  for  performing  the  modified 
Wiener  filter  approach  are  laid  out  along  with  the  steps  taken  in  MFCC  generation  and 
the  training  and  testing  of  the  GMMs  in  order  to  perform  identification  and  verification. 

3.2  Computer  Equipment  and  Tools 

Before  beginning  any  undertaking,  it  is  important  to  have  the  correct  tools.  The 
experiments  conducted  for  this  thesis  were  performed  on  Ultra  1  computers  by  Sun  Mi¬ 
crosystems.  One  of  the  machines  had  a  CD-ROM  drive  capable  of  reading  the  TIMIT  and 
NTIMIT  databases  which  were  transferred  to  an  external  4.0  GB  disk  drive.  All  of  the 
Ultras  ran  Sun  Microsystems’  Solaris  2.5  operating  system. 

Working  with  the  data  from  either  corpus  was  done  through  one  of  three  tools.  The 
operating  environment  was  set  up  for  UNIX  C  shells.  Shell  programs  were  used  to  create 
directory  structures,  lists  of  speakers  and  utterances,  and  to  automate  calls  to  other  tools. 
Mathworks’  MATLAB  version  4-2c  as  well  as  Entropic’s  HTK  -  Hidden  Markov  Toolkit 
version  2.0  and  ESPS  version  5.1  were  used  to  manipulate  utterances,  generate  MFCCs, 
train  Gaussian  mixture  models  (GMMs),  and  obtain  log-likelihood  scores. 

3.3  Speaker  Databases 

Developed  by  Texas  Instruments  (TI),  Inc.  and  the  Massachusetts  Institute  of  Tech¬ 
nology  (MIT),  the  TIMIT  database  was  initially  designed  for  speech  recognition.  It  was 
created  under  nearly  ideal  conditions  using  the  same  recording  equipment  over  a  short 
period  of  time  with  630  speakers.  The  recordings  were  made  with  an  8  kHz  bandwidth 
at  a  sampling  rate  of  16  kHz  [1].  This  means  that  there  is  a  low  degree  of  inter-session 
variability  in  a  given  speaker’s  utterances,  acoustical  noise,  and  equipment  variability. 
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The  NTIMIT  database  was  created  by  NYNEX  using  the  original  TIMIT  database 
[2].  The  original  utterances  were  played  back  through  an  artificial  mouth  into  a  carbon- 
button  telephone  handset.  The  resulting  speech  was  then  transmitted  to  central  offices  of 
local  or  long-distance  systems  and  looped  back  for  recording  at  a  16  kHz  sampling  rate. 

The  databases  are  originally  broken  into  test  and  training  sections  with  no  common 
speakers  between  the  sections.  These  sections  are  further  subdivided  into  eight  dialect 
regions  based  on  a  speaker’s  primary  residence  during  their  lifetime.  The  dialect  regions  are 
then  divided  into  speakers  with  each  speaker’s  10  utterances  in  the  speaker’s  subdirectory. 

3.4  Baseline 


Initial  experiments  were  done  to  reproduce  Reynolds’  previous  methods  [12]  and 
results  [13,14]  for  identification  and  verification  using  the  TIMIT  database.  The  process  for 


Figure  3.1  Baseline  Speaker  Verification  Process 

reproducing  the  verification  results  is  illustrated  by  Figure  3.1.  For  the  baseline,  filtering 
was  dependent  on  the  method  under  consideration.  Additional  details  are  provided  in  the 
following  sections. 

3.4-1  Training  and  Testing.  Each  speaker  has  10  utterances  divided  into  SAl 
and  SA2  (  both  of  which  are  common  to  all  speakers),  as  well  as  three  SI  sentences  and 
five  SX  sentences.  For  the  reproduction  of  results,  training  for  all  speakers  was  done  using 
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both  SAs,  all  Sis,  and  the  first  three  (in  UNIX  Is  order)  SX  utterances  from  the  TIMIT 
database.  All  testing  was  done  on  the  last  two  SX  utterances.  Initially,  training  and  testing 
were  only  done  on  the  168  speakers  from  across  all  eight  dialect  regions  in  the  test  section 
for  all  three  methods  under  consideration.  In  an  additional  round  of  experiments,  all  630 
speakers  were  used  only  for  the  straight  TIMIT  training  and  testing  cases  as  well  as  the 
straight  NTIMIT  testing  cases  for  speaker  ID  and  verification. 

3.4- 2  Straight  TIMIT  &  NTIMIT.  When  utterances  were  taken  straight  from  the 
original  database  recordings  with  no  warping  or  filtering,  they  are  referred  to  as  “straight” 
utterances.  Due  to  the  formatting  of  the  original  databases,  these  utterances  were  ma¬ 
nipulated  by  removing  the  header  information  using  the  ESPS  bhd  command  and  then 
using  the  UNIX  utility  dd  to  swap  an  utterance’s  byte  order.  The  m-file  detvoice.m  (see 
Appendix  B)  was  used  to  determine  the  voiced  and  unvoiced  segments  based  on  the  hand- 
labeled  phonetic  transcriptions  provided  with  each  utterance  in  the  TIMIT  and  NTIMIT 
databases.  The  transcriptions  were  used  to  facilitate  reproduction  of  results.  MFCCs  were 
created  using  HTK’s  HCopy  and  the  HTK  configuration  file  hconfig.  The  hconfig  file  was 
set  to  use  24  filters  in  the  filterbank  and  return  23  coefficients.  GMMs  were  created  from 
each  speaker’s  eight  training  utterances  while  likelihood  scores  were  generated  from  the 
two  test  utterances  from  each  speaker. 

3.4- 3  CMSBL.  To  reproduce  the  cepstral  mean  subtraction  with  bandlimiting 
(CMSBL)  method  a  separate  HTK  hconfig  file  was  made.  This  time,  however,  the  HTK 
hconfig  file  was  also  set  to  bandlimit  the  utterances  from  400  -  3200  Hz  in  accordance 
with  [12]  before  calculating  the  MFCCs.  The  voiced  and  unvoiced  detection  was  again 
performed  using  detvoice.m  to  access  the  hand-labeled  phonetic  transcription  files.  New 
GMMs  were  created  for  each  speaker  from  these  MFCCs  and  new  log-likelihood  scores 
were  computed  for  each  speaker’s  test  utterances. 

3.5  Modified  Wiener  Filtering 

The  process  of  manipulating  utterances  using  the  modified  Wiener  filtering  approach 
was  similar  to  the  straight  and  CMSBL  methods.  The  difference  in  the  preprocessing  is 
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Figure  3.2  Modified  Wiener  Filter  Speaker  Verification  Process 


illustrated  in  Figure  3.2.  As  Figure  3.2  illustrates,  modified  Wiener  filtering  was  applied 
to  each  of  the  utterances  prior  to  generating  the  MFCCs.  Each  utterance  was  filtered 
using  the  claimant’s  long-term  average  spectrum.  This  was  done  by  approximating  the 
frequency  response  of  the  channel  using 


where  Pc(^)  was  determined  by  averaging  the  DFT  from  window  size  segments  of  each  of 
a  claimant’s  training  utterances,  Ps{u>)  and  Pn{^)  were  the  desired  signal  and  the  noise 
signal,  respectively,  and  P^  was  the  measured  power  spectral  density. 

The  impulse  response  of  the  filter,  was  calculated  and  convolved  with  the 

original  utterance  to  obtain  a  new  utterance  U.  The  new  utterance  was  then  broken  into 
voiced  and  unvoiced  segments  using  the  appropriate  hand-labeled  phonetic  transcription 
files  and  the  MFCCs  calculated  for  the  voiced  segments.  New  GMMs  were  generated  for 
this  method  also.  The  process  was  then  repeated  for  each  speaker’s  two  test  utterances 
using  all  possible  speakers  as  the  claimant. 

The  inspiration  for  this  approach  came  from  work  done  with  Wiener  filtering  of 
distorted  images  [8].  By  using  this  modified  approach  on  speech  corrupted  by  a  channel, 
the  signal  resulting  from  the  warping  should  skew  the  spectrum  of  an  utterance  to  be 
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more  like  those  of  the  claimant  and  the  claimant’s  cohorts  as  seen  earlier  in  this  chapter. 
Additionally,  this  approach  will  smear  the  spectra,  just  as  its  image  counterpart,  which 
may  result  in  greater  speaker  ID  errors  but  hopefully  fewer  verification  errors  similar  to 
the  CMSBL  method. 

3. 6  MFCC  Calculation 

Once  header  information  had  been  removed,  the  byte  order  of  the  utterances  swapped, 
and  the  voiced  and  unvoiced  segments  determined,  the  MFCCs  were  calculated  using 
HTK’s  HCopy  command.  This  command  was  invoked  either  manually  or  by  the  m-file 
modwmfcc.m  (see  Appendix  B),  to  compute  the  MFCCs  for  voiced  segments  according  to 


where  K  was  the  number  of  desired  coefficients,  N  was  the  number  of  filters  to  use  in  the 
filterbank,  and  mj  was  the  log  amplitude  of  the  filter  of  the  triangular  filters  along  the 
mel-scale  defined  by 

Mel(/)  =  2595  log  +  (3.3) 

The  hconfig  file  used  with  HCopy  varied  depending  on  whether  straight  or  CMSBL  MFCCs 
were  desired. 

3.7  GMM  Generation 

3.7.1  HMM  or  GMM?  Once  the  MFCCs  had  been  calculated,  models  were  de¬ 
veloped  for  each  speaker  under  each  of  the  three  methods.  For  expediency  and  the  ability 
to  facilitate  a  reproduction  of  the  experiments,  HTK  was  used  to  develop  a  degenerate 
form  of  hidden  Markov  model  (HMM).  HTK  creates  HMMs  that  always  have  initial  and 
ending  non-emitting  states.  By  including  only  one  state  between  them  and  “growing”  that 
state  into  multiple  Gaussian  mixtures,  a  Gaussian  mixture  model  (GMM)  may  be  created. 
From  this  point  on,  the  HTK  HMMs  will  simply  be  referred  to  as  GMMs. 
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3.1.2  Prototype  GMM  Development.  A  prototype  model  was  first  created  ac¬ 
cording  to  the  specifications  in  The  HTK  Book  [18].  A  sample  prototype  for  straight  data 
is  included  below. 

1.  <BeginHMM> 

2.  <NumStates>  3  <VecSize>  23  <MFCC>  <nullD>  <diagC> 

3.  <StreamInfo>  1  23 

4.  <State>  2  <NumMixes>  1 

5.  <Streani>  1 

6.  <Mixture>  1  1.0 

7.  <Mean>  23 

8.  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0 

9.  <Variance>  23 

10.  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0 

11.  <TransP>  3 

12.  O.OOOe-l-0  l.OOOe-l-0  O.OOOe-hO 

13.  O.OOOe-l-0  6.000e-l  4.000e-l 

14.  O.OOOe-l-0  O.OOOe-l-0  0.000e-|-0 

15.  <EndHMM> 

All  prototypes  and  model  definitions  begin  with  the  line  “<BeginHMM>”.  The  second 
line  of  the  prototype  indicates  the  total  number  of  states  that  the  model  should  have,  the 
size  of  the  vectors  it  should  expect,  the  type  of  data,  any  special  handling,  and  the  type 
of  covariance  matrix  to  assume.  In  this  thesis,  only  three  states  were  required  (the  initial 
non-emitting  state,  the  state  of  interest,  and  the  final  non-emitting  state).  For  consistency 
with  Reynolds,  the  vector  size  was  chosen  to  calculate  23  MFCCs  using  a  bank  of  24 
filters  [11].  The  third  line  indicates  the  number  of  sources  that  will  be  presented  to  the 
model  and  their  size.  Line  4  determines  the  state  number  and  its  number  of  mixtures. 
Line  6  indicates  the  mixture  number  for  a  given  state  and  the  probability  of  that  mixture. 

For  each  mixture  in  a  state,  other  than  the  initial  and  final  states,  the  state  number 
and  the  desired  number  of  means  and  variances  for  the  state  are  also  indicated,  e.g.,  lines 
7  and  9.  Below  the  number  of  means  and  variances,  an  appropriate  number  of  constants 
are  entered  as  place  holders  in  lines  8  and  10.  The  actual  values  are  not  important,  only 
that  there  be  as  many  as  indicated  by  VecSize.  To  speed  implementation,  the  means  are 
typically  zero  and  the  variances  one.  Line  11  indicates  the  beginning  of  the  transition 
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probability  matrix  for  the  HMM  and  indicates  the  number  of  states.  In  practice,  the 
transition  probabilities  should  be  close  to  the  final  results,  but  do  not  need  to  be  exact. 

It  should  be  noted  that  failure  to  split  the  model  from  one-to-two  and  then  two-to-four 
components  before  reestimating  had  suboptimal  results.  Failure  to  make  the  two-to-four 
split  before  reestimating  resulted  in  returning  to  a  single  component  after  reestimation 
following  the  initial  split  rather  than  adjusting  to  two  components. 

3.1.3  Training  the  GMM.  Once  the  MFCCs  were  generated  for  all  the  utterances 
and  a  prototype  GMM  developed,  an  individual  speaker’s  GMM  was  “grown”  to  the  desired 
size  of  32  components  [11, 12].  This  was  done  by  selecting  a  given  speaker  and  the 
associated  training  MFCCs  and  presenting  them  to  the  GMM.  This  process  was  done 
using  the  C-shell  script  gmm2maker  (found  in  Appendix  B)  and  is  illustrated  with  the 
following  pseudo  code. 

1.  Initialize  the  model  using  the  speaker’s  training  utterances  and  HInit 

2.  Reestimate  the  model  using  all  training  utterances  with  HRest 

3.  Perform  a  binary  split  of  the  mixture  using  HHEdit  with  MU  2  (MU  q  is  an  HTK 
command  that  increases  the  number  of  components  in  the  mixture  to  q) 

4.  Perform  a  binary  split  of  both  mixtures  using  HHEdit  with  MU  4 

5.  Reestimate  the  model  with  HRest  using  all  training  utterances 

6.  Perform  a  binary  split  of  the  top  two  mixtures  using  HHEdit  with  MU  (2  X  x) 
where  x  =  3  . . .  M/2 

7.  Reestimate  the  model  using  all  training  utterances  with  HRest 

8.  Repeat  6  and  7  until  the  desired  number  of  mixtures  is  reached 

3.7.4  Probability  Scores.  With  GMMs  created  for  all  of  the  speakers,  log- 
likelihood  scores  for  a  a  given  speaker’s  utterance  were  obtained  using  the  C-shell  script 
uttscores2.c  (in  Appendix  B).  The  scores  were  calculated  based  on  Equation  3.4  and 
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Equations  2.4  and  2.5  as 


1  ^ 

log  C{U  I  log  [p(x„  I  A^)] ,  (3.4) 

n=l 

where  N  is  the  number  of  voiced  samples  in  an  utterance. 

The  scores  were  calculated  using  HTK’s  HVite  command.  HVite  presented  each 
utterance’s  MFCCs  separately  to  each  of  the  168  or  630  models  and  saved  the  scores. 
Varying  the  appropriate  parameters  in  uttscores2.c  obtained  the  results  for  all  630  speakers. 
The  script  was  used  to  generate  scores  for  both  SA  utterances  as  well  as  the  last  two  SX 
utterances  from  each  speaker.  The  scores  from  the  SA  utterances  were  used  to  determine 
initial  thresholds  for  verification  results  with  a  minimal  equal  error  rate.  The  scores  from 
the  SX  utterances  were  used  to  determine  the  system’s  success  with  the  SA  thresholds  as 
well  as  to  detemine  an  equal  error  rate  of  their  own. 

3.8  Speaker  Identification 

Speaker  identification  required  using  MATLAB  to  determine  which  GMM  generated 
the  highest  log-likelihood  score  for  a  given  utterance  by  reading  in  the  files  created  by 
uttscores2.c  (see  Appendix  B).  If  the  index  of  the  highest  score  corresponded  to  the  index 
of  the  utterance’s  speaker,  the  trial  was  considered  to  have  correctly  identified  the  speaker. 
Otherwise,  the  trial  was  considered  to  have  made  an  error  in  speaker  identification.  Win¬ 
ning  models  were  chosen  based  according  to 

r  =  arg  max  -[log  [/:(l7|Ai)j  | .  (3.5) 

This  process  was  repeated  for  all  speakers  and  their  corresponding  SA  utterances  and 
separately  for  SX  testing  utterances  using  the  m-file  spkrid.m  (see  Appendix  B)  with  the 
results  from  each  iteration  saved  to  separate  files. 
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3.9  Speaker  Verification 


3.9.1  Cohort  Selection.  Having  already  calculated  the  probability  scores  for  each 
speaker’s  SAl  and  SA2  utterances  for  all  GMMs  under  consideration,  determining  the  10 
cohorts  was  straight-forward.  (Note:  this  differs  slightly  from  [11]  in  which  cohorts  were 
determined  by  obtaining  a  probability  score  from  the  concatenation  of  all  the  training 
MFCCs.)  First,  a  speaker  i  and  a  different  speaker  j  were  chosen  and  a  distortion  score 
was  determined  according  to  Equation  3.6. 


dsym{}^ii  —  log 


C{Ui\\i)  ^  .  ^|A,) 


for  i  7^  j, 


where  Ui  is  speaker  i’s  SAl  utterance,  \x  is  speaker  re’s  GMM,  and  Uj  is  speaker  j’s  SA2 
utterance. 


These  scores  were  then  sorted  in  ascending  order.  The  five  maximally  spread  close 
cohorts  were  the  distortion  scores  closest  to  zero  and,  therefore,  believed  to  sound  most  like 
the  speaker  but  were  not  duplicates  (speakers  whose  scores  were  not  necessarily  adjacent 
to  one  another).  The  five  maximally  spread  far  cohorts  were  those  whose  distortion  scores 
were  furthest  from  zero  and,  therefore,  were  believed  to  sound  least  like  the  speaker  but 
who  were  not  duplicates  (speakers  whose  distortion  scores  were  not  necessarily  adjacent 
to  one  another).  A  detailed  algorithm  for  cohort  selection  can  be  found  in  Appendix  A. 
The  process  was  automated  to  find  cohorts  for  all  speakers  using  the  m-file  maincohort.m 
(see  Appendix  B).  Separate  cohort  sets  were  found  for  the  straight,  CMSBL,  and  modified 
Wiener  filter  methods. 


3.9.2  Verification  Scores.  After  cohorts  were  determined  for  each  speaker,  ver¬ 
ification  scores  were  calculated  for  each  speaker’s  test  SX  utterance  where  each  of  the 
registered  speakers  posed  as  the  claimant.  Verification  scores  were  calculated  according  to 

v{U)  =  log  [£([/|Ae)]  -  I  E  >  (3-7) 

sen 
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where  U  is  the  utterance,  fi  is  the  claimant’s  cohort  set,  B  is  the  number  of  cohorts  in 
the  set,  Aj  is  the  appropriate  GMM  for  speaker  a;,  and  x  €  {{c}  U  {s  |  s  6  fl}}.  While 
verification  scores  were  calculated  all  cases,  only  the  results  for  testing  without  cohorts 
were  reported  for  consistency  with  [11]. 

5.10  Equal  Error  Rate 

Before  determining  the  equal  error  rate  (EER),  the  verification  scores  for  all  of  the 
tested  SX  utterances  for  each  of  the  speakers  were  sorted  in  ascending  order.  An  initial 
arbitrary  threshold  was  chosen  and  the  number  of  false  accepts  and  false  rejects  determined. 
The  rates  for  false  accepts  and  false  rejects  were  calculated  and  the  EER  assigned  as  the 
average  of  the  false  accept  and  false  reject  rates.  The  final  EER  was  determined  by 
finding  the  threshold  for  which  the  difference  between  false  accept  and  false  reject  rates 
was  minimum. 

3.11  Summary 

This  chapter  has  presented  the  approach  and  methods  used  in  this  thesis.  A  brief 
listing  of  the  computer  hardware,  software,  and  databases  used  was  provided.  Discussions 
on  how  MFCCs  were  generated  and  on  the  procedures  used  in  building  and  training  a 
GMM  were  provided.  Additionally,  the  methods  used  for  speaker  identification  and  speaker 
verification,  including  the  selection  of  cohorts,  were  also  given.  Finally,  the  method  used 
for  calculating  the  EER  was  discussed. 
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IV.  Experimental  Results  and  Analysis 


4-1  Introduction 

The  results  from  the  various  trials  are  presented  in  terms  of  process,  the  corpus, 
and  the  method  used.  The  first  section  covers  the  speaker  ID  results  for  both  databases 
and  each  method.  Then  the  speaker  verification  results  are  presented.  Lastly,  the  first 
known  speaker  ID  and  verification  results  for  the  entire  630  speakers  in  straight  TIMIT 
and  NTIMIT  are  presented.  For  each  process,  training  was  done  only  on  TIMIT  utterances. 


4.2  Identification 

4.2.1  TIMIT.  Tables  4.1,  4.2,  and  4.3  give  the  results  from  testing  against  the 
168  test  speakers  of  TIMIT  using  the  SA  and  SX2  utterances  from  TIMIT.  Of  the  three 
methods  (straight,  CMSBL,  and  modified  Wiener  filtering),  the  straight  method  had  the 
best  speaker  ID  results  for  the  test  SX  utterances  with  100%  accuracy  for  the  168  speakers. 
The  CMSBL  approach  was  a  close  second  making  only  three  errors  on  the  test  utterances 
for  99.1%  correct  identification.  The  modified  Wiener  approach  performed  the  worst  with 
84.2%  correct  identification.  The  decrease  in  correct  identification  by  the  CMSBL  approach 
supports  Reynolds’  assertion  that  cepstral  mean  subtraction  will  reduce  performance  when 
training  and  testing  conditions  are  the  same  [12]. 


Table  4.1  Straight  TIMIT  Identification  for  168  speakers 


Utterance 

Errors 

#  Correct 

Errors  % 

Correct  % 

SAl  and  SA2 

0 

336 

0 

100 

test  SX 

0 

336 

0 

100 

Table  4.2  CMSBL  TIMIT  Identification  for  168  speakers 


Utterance 

#  Errors 

Correct 

Errors  % 

Correct  % 

SAl  and  SA2 

0 

336 

0 

100 

test  SX 

3 

333 

0.893 

99.1 

4.2.2  NTIMIT.  Table  4.4  illustrates  the  results  from  testing  the  NTIMIT  SX2 
utterances  against  the  168  test  speakers.  Again,  the  GMMs  were  trained  on  TIMIT  ut¬ 
terances  and  tested  against  the  NTIMIT  ones,  and  providing  tests  done  under  channel 
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Tab: 


ie  4.3  Modified  Wiener  filter  TIMIT  Identification  for  168  speakers 


Utterance 

#  Errors 

^  Correct 

Errors  % 

Correct  % 

SAl  and  SA2 

47 

289 

14.0 

86.0 

test  SX 

53 

283 

15.8 

84.2 

mismatch  conditions.  This  time,  the  CMSBL  method  performed  the  best  of  the  three 
methods  with  27.4%.  The  modified  Wiener  filter  method  was  a  distant  second  with  only 
4.17%  correct  and  the  straight  method  was  third  with  3.57%. 


Table  4.4 

Straight  N'! 

TMIT  Identification  for  168  speakers 

Utterance 

#  Errors 

#  Correct 

Errors  % 

Correct  % 

Straight 

324 

12 

96.4 

3.57 

CMSBL 

244 

92 

72.6 

27.4 

Mod  W 

322 

14 

95.8 

4.17 

4-3  Verification 

4-3.1  TIMIT,  Tables  4.5,  4.6,  and  4.7  illustrate  the  results  from  testing  the 
TIMIT  SX2  utterances  against  the  168  TIMIT  test  speakers.  The  verification  EER  for  the 
TIMIT  test  utterances  was  extremely  close  (within  one  false  reject)  to  Reynolds’  «  0.24% 
EER  [14].  This  discrepancy  is  likely  the  result  of  the  subtle  difference  in  cohort  selection. 
The  CMSBL  method  produced  the  best  results  with  the  straight  method  a  close  second  and 
the  modified  Wiener  method  a  distant  third.  In  Tables  4.5,  4.6,  and  4.7,  a  indicates 
misleading  EERs  due  to  averaging. 


Table  4.5 

Straight  Ti 

[MIT  Verii 

[ication  for  168  spea] 

kers 

Utterance 

Threshold 

EER% 

#fa 

#fr 

FA  % 

FR  % 

SAl  and  SA2 

10.375 

0.576 

148 

2 

0.557 

0.595 

test  SX 

0.504  * 

375 

1 

mi 

mgn 

test  SX 

■n 

0.595 

314 

2 

Hi 

In  these  verification  tables,  the  first  test  SX  threshold  is  taken  from  the  threshold 
obtained  for  training  SA  threshold  scores  as  would  be  done  in  a  real-world  system.  The 
second  test  SX  threshold  was  determined  by  allowing  the  system  to  find  the  EER  and 
threshold  for  the  test  SX  utterances.  While  this  second  test  SX  threshold  could  not  be 
done  in  a  practical  system,  the  result  provides  a  means  of  gauging  how  well  the  system 
might  have  done  with  either  additional  training  utterances  to  determine  the  threshold  or 
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Table  4.6 

CMSBL  T 

MIT  Veri 

fication  for  168  speal 

cers 

Utterance 

Threshold 

EER% 

#fa 

#fr 

FA  % 

FR  % 

SAl  and  SA2 

6.527 

0.338 

200 

1 

0.379 

0.298 

test  SX 

\  91 

||jj|gr|a 

0 

test  SX 

■HSl 

D 

Ilf 

ible  4.7  Modified  Wiener  filter  TIMIT  Veri 

fication  for  168  speake 

Utterance 

Threshold 

EER% 

#fa 

#fr 

FA  % 

FR  % 

SAl  and  SA2 

4.5081 

7.85 

4,202 

26 

7.97 

7.74 

test  SX 

8.77 

3,758 

35 

mMm 

■ng 

test  SX 

4.2172 

8.11 

4,313 

27 

in  a  text-dependent  scheme.  For  the  straight  and  CMSBL  methods,  the  “better”  EERs 
increased  the  number  of  false  rejects  by  one,  while  the  number  of  false  accepts  was  reduced. 
For  the  modified  Wiener  filter  method,  the  new  ERR  has  similar  effects  but  with  more 
drastic  results.  The  number  of  false  rejects  decreased  form  35  to  27  while  the  number  of 
false  accepts  rose  from  3,758  to  4,313. 


4.3.2  NTIMIT.  Tables  4.8,  4.9,  and  4.10  give  the  results  of  testing  the  NTIMIT 
SX2  utterances  against  the  168  test  speakers  of  TIMIT  under  mismatched  testing  and 
training  conditions.  The  results  were  similar  to  those  for  TIMIT.  Again,  the  CMSBL 
method  produced  the  best  results  with  an  EER  of  19.7%,  almost  half  that  of  either  the 
straight  method’s  37.2%  or  the  modified  Wiener  filter’s  39.0%.  As  in  the  previous  tables, 
a  indicates  misleading  EERs  due  to  averaging  and  indicates  methods  that  are 
not  practical  since  most  real  users  are  kept  out.  For  the  NTIMIT  verification  tables. 


Table  4.8  Straight  NTIMIT  Verification  for  168  speakers 


Utterance 

Threshold 

EER  % 

#fa 

#fr 

FA  % 

FR% 

test  SX 

21 

334 

test  SX 

■Bi 

18 

334 

■nBli  1 

test  SX 

1.19 

37.1 

19,686 

124 

37.3 

36.9 

the  first  test  SX  threshold  is  taken  from  the  threshold  obtained  for  training  TIMIT  SA 
threshold  scores  as  would  be  done  in  a  real-world  system.  The  second  test  SX  threshold 
was  determined  from  the  threshold  for  the  lowest  EER  from  the  TIMIT  test  SX  utterances. 
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Table  4.9  CMSBL  NTIMIT  Verification  for  168  speakers 


Utterance 

Threshold 

EER% 

#fa 

#fr 

FA  % 

FR  % 

test  SX 

49.1  * 

13 

330 

0.0246 

98.2 

test  SX 

49.7 

7 

334 

0.0133 

99.4 

test  SX 

1.237 

19.6 

10,359 

66 

19.6 

19.6 

le  4.10  Modified  Wiener  filter  N'^ 

riMIT  Verifica 

tion  for  168  spea 

Utterance 

Threshold 

#fr 

FA  % 

FR% 

test  SX 

4.5081  - 

46.8  * 

528 

311 

1.00 

92.6 

test  SX 

4.2172  - 

46.2  * 

751 

306 

1.42 

91.1 

test  SX 

0.6549 

20,566 

131 

39.0 

39.0 

Tab: 


The  final  test  SX  threshold  was  determined  by  allowing  the  system  to  find  a  more  accurate 
EER  and  threshold  for  the  NTIMIT  test  SX  utterances. 


4-4  Entire  Corpus 

Tables  4.11  and  4.12  give  the  results  for  speaker  identification  and  speaker  verifi¬ 
cation,  respectively,  of  testing  all  630  speakers  of  TIMIT  using  utterances  from  TIMIT 
and  NTIMIT.  Table  4.11  illustrates  that,  even  for  the  entire  630  speaker  set,  the  straight 
method  yields  excellent  results  for  matched  training  and  testing  conditions.  When  testing 
with  NTIMIT  utterances,  however,  the  result  is  almost  the  inverse  with  only  10  correct 
identifications.  These  results  emphasize  the  need  for  additional  processing  of  the  utterances 
prior  to  identification  under  channel  mismatch  conditions. 


Table  4.11  Straight  Identification  for  630  speakers 


Corpus 

Utterance 

#  Errors 

44  Correct 

Errors  % 

Correct  % 

TIMIT 

SAl  and  SA2 

0 

100 

test  SX 

1 

99.9 

NTIMIT 

test  SX 

1250 

10 

99.2 

0.794 

Large  population  speaker  verification  results  are  given  in  Table  4.12.  For  the  TIMIT 
SX  testing  utterances,  the  first  threshold  (10.8959)  was  determined  from  the  SA  utterances. 
The  second  threshold  (10.8682)  yielded  the  best  EER  for  the  TIMIT  SX  testing  utterances. 
For  the  NTIMIT  SX  testing  utterances,  the  three  thresholds  represent  the  TIMIT  SA,  the 
TIMIT  testing  SX,  and  NTIMIT  testing  SX  error  rates. 
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Table  4.12  Straight  Verification  for  630  speakers 


Corpus 

Utterance 

Threshold 

EER% 

#fa 

#fr 

FA  % 

FR  % 

TIMIT 

SAl  and  SA2 

10.8959 

0.510 

6 

0.544 

0.476 

test  SX 

10.8959 

0.666 

4,813 

9 

mgm 

10.8682 

0.596 

4,885 

7 

IB 

Bl 

NTIMIT 

test  SX 

10.8959  - 

49.8  * 

439 

1,255 

0.0563 

99.6 

test  SX 

10.8682  - 

49.8  * 

449 

1,255 

0.0576 

99.6 

test  SX 

1.267 

38.7 

302,071 

488 

38.7 

38.7 

Despite  increasing  the  number  of  speakers  used  by  Reynolds  from  168  to  630,  the 
EER  for  the  TIMIT  SX  case  changed  by  0.16%  for  the  SA  threshold  and  only  0.04%  for 
the  second  test  SX  threshold.  The  difference  in  thresholds  across  mismatched  channel  con¬ 
ditions  in  Table  4.12  illustrates  just  how  difficult  the  problem  was.  Mismatched  conditions 
not  only  influenced  the  error  rates  for  a  given  threshold  but  dramatically  affected  the  EER 
threshold  as  well. 


4.5  Summary 

In  this  chapter,  Reynolds’  results  for  speaker  identification  were  reproduced  for 
straight  TIMIT.  His  results  for  speaker  verification  using  only  the  168  test  speakers  from 
TIMIT  were  also  reproduced  to  within  one  false  rejection.  The  CMSBL  method  performed 
almost  as  well  as  the  straight  method  when  applied  to  TIMIT  utterances  and  better  than 
the  straight  method  when  applied  to  NTIMIT  utterances.  The  modified  Wiener  filtering 
method  was  found  to  be  significantly  less  effective  for  speaker  identification  of  TIMIT  ut¬ 
terances  and  only  slightly  better  than  the  straight  method  for  NTIMIT  utterances.  The 
modified  Wiener  filtering  method  performed  significantly  worse  than  CMSBL  for  both 
TIMIT  and  NTIMIT  testing  conditions.  For  mismatched  channel  conditions,  the  results 
for  speaker  verification  were  similar  to  speaker  identification  with  CMSBL  doing  the  best. 
The  straight  method  did  an  excellent  job  of  speaker  ID  for  the  large  (630)  speaker  pop¬ 
ulations  when  testing  on  TIMIT  utterances  with  99.9%  correct.  However,  the  results  for 
NTIMIT,  only  0.794%  correct,  indicate  the  need  for  some  form  of  preprocessing  under 
mismatched  channel  conditions. 
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V.  Conclusions 


5. 1  Conclusions 

This  thesis  reproduced  previously  obtained  results  for  speaker  identification  and  ver¬ 
ification  using  MFCCs  and  GMMs  for  a  population  size  of  168  speakers.  The  previous 
work  was  then  extended  to  a  larger  population  of  630  speakers.  A  modified  Wiener  fil¬ 
tering  approach  was  used  in  an  attempt  to  minimize  channel  mismatch  effects  on  speaker 
verification.  It  was  found,  however,  that  this  approach  yielded  worse  performance  by  a 
factor  of  two  than  traditional  cepstral  mean  subtraction  with  bandlimiting.  The  modified 
Wiener  approach  was  2%  worse  than  using  straight  models  with  unfiltered  speech.  Three 
conclusions  can  be  drawn  from  these  results.  First,  the  results  for  all  630  speakers  for 
matched  training  and  testing  conditions  imply  that  there  is  ample  room  in  the  23  MFCC 
feature  space  for  speaker  ID  and  verification.  The  channel  mismatch  conditions,  however, 
require  some  sort  of  preprocessing  prior  to  speaker  ID  or  verification.  Second,  while  the 
modified  Wiener  filtering  method  looked  promising  from  a  mathematical  standpoint,  no 
filtering  of  an  utterance  yields  better  results  at  a  lower  computational  cost.  Lastly,  CMSBL 
yields  the  best  verification  results  for  matched  or  mismatched  conditions  with  only  a  minor 
degradation  in  speaker  ID  under  matched  conditions. 

5.2  Future  Study 

Suggestions  for  future  research  include  the  following: 

•  Convolve  the  inverse  DFT  of  the  speaker’s  average  spectra  with  the  original  utterance. 
This  method  was  found  serendipitously  and  the  results  sounded  subjectively  better 
than  the  original  TIMIT  utterances.  This  is  along  the  lines  of  Stockam’s  efforts  [17]. 

•  Try  Avendro  and  Hermanksy’s  method  of  channel  normalization  [3]. 

•  Create  versions  of  YOHO  similar  to  NTIMIT  and  CTIMIT  to  test  channel  effects  on 
a  standard  speaker  verification  corpus  instead  of  a  speech  recognition  corpus. 

•  Investigate  the  use  of  speaker  ID  results  to  improve  speaker  verification  results. 
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•  Investigate  the  use  of  speaker-dependent  thresholds  instead  of  using  a  global  thresh¬ 
old  for  speaker  verification. 

•  Continue  the  search  for  feature  vectors  that  are  more  robust  to  channel  effects  than 
MFCCs. 
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Appendix  A.  Cohorts 


A.l  Calculating  Symmetric  Distortion  Scores 

Cohorts  are  chosen  from  among  all  of  the  registered  speakers  as  those  that  appear 
most  and  least  like  the  claimant.  This  algorithm  is  based  on  the  one  given  by  Reynolds  [14]. 
To  select  a  speaker’s  cohorts,  one  utterance  from  the  speaker  and  one  utterance  from  each 
of  the  other  registered  speakers  are  required.  Instead  of  a  single  utterance  from  a  speaker, 
a  concatenation  of  all  the  speaker’s  training  utterances  may  also  be  used.  A  symmetric 
distortion  score  is  calculated  for  a  given  speaker  i  and  one  of  the  other  registered  speakers 
j  as 

dsymi^ii  ^j)  —  log 

where  and  Xx  are  utterances  by  and  speaker  models  for  speaker  x,  respectively.  In 
this  way,  a  distortion  score  is  calculated  for  the  given  speaker  i  and  each  of  the  remaining 
registered  speakers.  The  resulting  scores  are  then  sorted  in  ascending  order. 

A. 2  Selecting  Cohorts 

Set  B  to  the  total  number  of  desired  cohort  speakers  to  be  selected,  where  B  is  even 
when  choosing  both  close  and  far  cohorts.  For  selecting  both  close  and  far  cohorts,  set 
5  =  R/2.  For  selecting  only  close  or  only  far  cohorts,  set  S  =  B.  To  avoid  having  cohorts 
that  are  extremely  similar,  a  maximally  spread  algorithm  is  used  per  Reynolds  [11]. 

A. 2.1  Chosing  Close  Cohorts.  For  a  given  speaker  i,  this  process  chooses  the 
registered  speakers  that  “sound”  most  like  the  given  speaker.  Choose  Ntot  speakers  with 
the  smallest  distortion  scores  to  create  a  pool  of  potential  close  cohorts  PCi  for  speaker 
i.  The  potential  close  cohorts  should  exclude  speaker  i,  and  Ntot  should  be  chosen  large 
enough  such  that  Ntot  >  S. 

Step  0:  Move  the  closest  speaker  (i.e.,  the  one  with  the  smallest  distortion  score)  from 
PCi  to  Q,  the  final  set  of  close  cohort  speakers  for  speaker  i. 

Set  C  =  1,  where  C  is  the  number  of  cohort  speakers  already  selected  for  speaker  i. 


PiUMi)  ,  ,  ^|A,) 
piUi\Xj)  piUjiXiY 


for  i  ^  j, 


(A.1) 
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step  1;  Move  speaker  6  from  PCi  to  Cj  where 


d  —  are  max 

cePCi 


1  ^  ^c) 


(A.2) 


Set  C  =  C'  +  1 

Step  2:  Repeat  Step  1  until  C  =  S. 


A. 2.2  Chasing  Far  Cohorts.  For  a  given  speaker  i,  this  process  chooses  the 
registered  speakers  that  “sound”  least  like  the  given  speaker.  First,  choose  the  Ntot  >  S 
speakers,  excluding  speaker  i,  who  had  the  largest  distortion  scores  in  order  to  create  a 
pool  of  potential  far  cohorts,  PFi. 


Step  0;  Move  the  furthest  speaker  (i.e.,  the  one  with  the  largest  distortion  score)  from 
PFi  to  Fi,  the  final  set  of  far  cohort  speakers. 

Set  F'  =  1,  where  F'  is  the  number  of  far  cohort  speakers  already  selected. 

Step  1:  Move  speaker  r  from  PFi  to  Fi  where 


r  =  arg  max 
fePi 


y,Y,d{^b,Xf)xd{Xi,Xf)  . 

bePi 


(A.3) 


Let  F'  =  F'  +  1. 

Step  2:  Repeat  Step  1  until  F'  =  S. 

When  both  close  and  far  cohorts  are  desired,  create  a  total  cohort  set,  fij,  for  each 
speaker  i  using  both  sets  of  cohorts  so  that 

=  {Ci  U  Fi}.  (A.4) 

If  only  close  cohorts  are  desired,  use  =  {Q}.  Calculating  symmetric  distortion  scores 
and  selecting  cohorts  is  then  repeated  for  all  of  the  registered  speakers. 
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Appendix  B.  C-Shell  Scripts  and  MATLAB  m- files 


B.  1  m- files  ^ 


% 

%  detvoice.m 

% 

%  function  vuv=detvoice(curphon,phndb); 

% 

%  Description:  This  function  determines  whether  the  phoneme  interval  containing 
%  the  current  frame  of  speech  is  voiced  or  unvoiced. 

% 

%  Author:  Capt  Al  Arb,  USAF 

%  Date:  30  Jul  96  10 

%  Modified: 

% 

%  Input  parameters: 

%  curphon:  The  phoneme  label  for  the  current  frame  of  speech. 

%  phndb:  The  TIMIT  phoneme  data  base  matrix. 

% 

%  Output  Parameter: 

%  vuv:  voiced/ unvoiced.  l=voiced,  0=unvoiced. 

% 

%  Subroutines  directly  called:  20 

%  none 

% 

%  Subroutines  indirectly  called: 

%  none 

% 

function  vuv=detvoice(curphon,phndb) ; 
vuv=0; 

count=0;  30 

done=0; 

% 

%  Loop  until  we  find  the  label 

% 

while  "done 
count=count+l; 

% 

%  If  the  DB  entry  =  curphon,  we  found  it.  40 

% 

if  phndb(count,l:4)==curphon 
done=l; 
phn=count; 

% 

%  Or  if  we  are  at  the  end  of  the  file,  stop  and  assume  unvoiced. 

% 

elseif  count==length(phndb(;,l))  50 

done=l; 
phn=0; 

end; 

end; 

if  phn 

% 


^  For  printing  purposes  the  m-files  were  concatenated  into  a  single  file.  The  line  numbering  in  the  right 
margin  is  correct  for  the  concatenated  file. 
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%  If  the  category  is  VOICED,  set  vuv  to  1. 

% 

if  phndb(phn,5:10)=='V0ICED' 
vuv=l; 

end; 

end; 

% 

%  distmtrx.m 

% 

%  [dscores]  =  distmtrx(salmatrix,  sa2matrix) 

% 

%  This  function  calculates  the  distortion  metric  scores  for  a  given 
%  speaker  when  provided  the  proper  inputs. 

% 

%  Key  assumptions:  Scores  are  log-likelihoods 
%  Matrices  are  square 

%  Rows  indicate  utterance  from  a  given  speaker 

%  Columns  indicate  model  for  a  given  speaker 

% 

%  Input: 

%  salmatrix  matrix  of  scores  for  each  model  for  one  utterance 

% 

%  sa2matrix  matrix  of  scores  for  each  model  for  second  utterance 

% 

%  Output: 

%  dscores  a  matrix  of  distortion  metric  scores 

% 

%  Created  by  Capt  R.  Brian  Reid 
%  Date:  8  Aug  1997 
% 

%  References:  Columbi  et.  al.  “Allowing  Good  Imposters  to  Test” 

%  based  on  Reynolds 

% 

%  Last  modified:  I4  Aug  1997 
%  1005 

% 

function  [dscores]  =  distmtrx(salmatrix,  sa2matrix) 
numspkrs  =  size(salmatrix,2); 
dscores  =  zeros(  numspkrs  ); 
for  iloop  =  1  :  numspkrs, 
for  jloop  =  1  :  numspkrs, 

dscores(  iloop,  jloop)  =  salmatrix(iloop,  iloop)  —  salmatrix(iloop,  jloop)  + 
sa2matrix(jloop,  jloop)  —  sa2matrix(jloop,  iloop); 

end; 

end; 


60 


80 


90 


100 


110 


% 

%  emailmsg 

% 

%  emailmsg(emailaddress, message);  120 

% 

%  Used  to  send  email  message  to  user  in  UNIX  OS. 

%  by:  Capt.  Edward  M.  Ochoa,  GEO-96D 

function  emailmsg(emailaddress,message) ; 

UCMDl=sprintf('echo  "*/,s"  >  /tmp/emailrasg.txt', message); 
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UCMD2=sprintf('echo  ""s  */,s"  »  /tmp/eraailmsg.txt', message); 

UCMD3=sprintf('echo  »  /tmp/emailmsg.txt'); 

UCMD4=sprintf('iiiail  '/,s  <  /tmp/emailmsg.txt ',emailaddress);  130 

UCMD5=sprintf('  !rm  /tmp/emailmsg.txt'); 

UCMDS=['!  '  UCMDl  UCMD2  UCMD3  UCMD4 

eval(UCMDS) 

eval(UCMD5) 


% 

%  exvu4htk.m 

% 

%  [voicedspeech, unvoiced]  =  exvu4htk(data,uttfile,fs,wlength,fstep,phnfile); 

% 


%  Inputs: 

%  data 

%  uttfile 

%  fs 

%  wlength 

%  fstep 

%  phnfile 

% 

% 


actual  waveform  data  (byte  swap  if  necessary) 

file  name  of  utterance  to  use 

sample  frequency  in  Hz 

window  size  in  seconds 

frame  step  size  in  seconds 

name  of  file  (*.phn)  with  phoneme  labels 

useful  if  phoneme  labels  are  in  a  separate  directory 


%  Derived  from  exvoiced4htk.m  code  by  Capt  Al  Harb 
%  Modified  by  Capt  R.  Brian  Reid 
%  19  Jul  1997 

% 


140 


150 


function  [voicedspeech, unvoicedspeech]  =  exvu4htk(data, uttfile, fs, wlength, fstep, phnfile); 
extind=flnd(uttfile== ' . '); 
if  phnfile  ==[] 

phnfile  =  [uttfile(l:extind)  'phn']; 
end; 

%  phnfile 

%[phnind,phnval]=readphn(setstr([uttfile(l;extind)  ’phn’])); 

[phnind,phnval] =readphn  (setstr(  [phnfile] ) ) ; 

phndb=loadphndb; 

wstart=l; 
wend=wlength*fs; 
finished=0; 
voicedspeech=[]; 
unvoicedspeech=  []; 

while  -finished, 

cur_phoneme=flnd((phnind(:,l)<=wend)&(phnind(:,2)>wend)); 

vuv=detvoice(phnval(cur_phoneme,l;4),phndb); 

if  vuv, 

voicedspeech=[voicedspeech;  data(wstaxt:wend)]; 
else 

unvoicedspeech  =  [unvoicedspeech;  data(wstaTt:wend)]; 
end; 

wstart=wstart+(fstep*fs); 
wend=wstart— 1+ (wlength*fs) ; 
if  wend  >  length(data), 
finished=l; 
end; 

end; 

% 


160 


170 


180 


190 
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%  fflavgf.m  is  a  function  to  calculate  the  average  FFT  or  just  the  FFT  of 
%  a  segment  of  a  speech  file 
% 

%  [avgfft]  =  fftavgf(data,  fftsize,  samplefrequency,  windowsize) 

% 

%  Inputs: 

%  data  Actual  data 
%  numsamples  Number  of  samples 

%  samplefrequency  sampling  frequency  data  was  obtained  at 
%  window  size  time  in  seconds  of  the  window  of  speech 

% 

%  Output: 

%  avgfft  Average  spectra  for  the  input 

% 

function  [avgfft]  =  fftavgf(data,  flFtsize,  samplefrequency,  windowsize) 

%  Read  in  the  utterance 
%  [data,  numsamples]  =  readhtkn(utterance,0); 
numsamples  =  max(size(data)); 

%  Determine  the  number  of  samples  per  chunk  of  time 
timechunks  =  samplefrequency  ♦  windowsize; 

%  Initialize  the  average  to  zero 
avgfft  =  zero8(fiftsize,l); 
runsum  =  zeros(fftsize,l); 

pointer  =  1; 

%  Incrementally  calculate  the  FFT 
numchunks  =  floor(numsamples  /  timechunks); 

for  loop  =  1  :  numchunks 

temp  =  abs(  fft(  data(  pointer  ;  pointer  —  1  +  timechunks  ),  fftsize  )  ); 
pointer  =  pointer  +  timechunks; 

%  Update  the  runningsum 
runsum  =  runsum  +  temp; 

end; 

%  Compute  the  average  as  running  sum  /  It  of  chunks 

if  numchunks  '=  numsamples  /  timechunks 
temp  =  abs(  fFt(  data(  pointer  :  length(data)  )  )  ); 

runsum  =  runsum  +  [temp;  zeros(iftsize  —  length(temp),  1)  ]; 

loop  =  loop  +  1; 

end; 

avgfft  =  runsum  /  loop; 


200 


210 


220 


230 


240 


250 


% 

%  fndcchrt.m 

% 

%  [cohorts]  =  fndcchrt(distortionmatrix,  N,  B,  spkmum): 

% 

%  fndcchrt  finds  the  close  cohorts  for  a  given  distortion  matrix.  260 

% 

%  Inputs: 

%  distortionmatrix  a  square  matrix  of  distortion  scores 
%  N  size  of  pools  to  use  in  selecting  background  speakers 

%  B  size  of  background  speakers  for  maximally  spread  close 

%  spkmum  indicates  number  of  speaker  in  distortionmatrix 

% 
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270 


%  Ouput: 

%  Bi  vector  of  pointers  to  speakers  determined  to  be  cohorts 

% 

%  References: 

%  Reynolds  “Speaker  Identification  and  verification  using 
%  Gaussian  mixture  speaker  models.  “  Speech  Communication 

%  17  (1995)  p  91-108 

% 

%  Created  by  Capt  R.  Brian  Reid 

% 

%  Created  on  14  Aug  1997 

% 

%  Last  Modified:  26  Aug  1997 
%  1138 

% 

function  Bi  =  fndcchrt (matrix,  N,  B,  spkrnum); 

Nc  =  N; 

Nf  =  N; 

numspkrs  =  size(matrix,2); 

%  Setup 

%  Find  the  set  of  maximally  spread  close  speakers  for  spkrnum 

%  Select  spkmum’s  utterance  against  all  models 
Dv  =  matrix(spkrnum,  :  ); 

%  Create  an  index  of  all  of  the  speakers 
indexptr  =  linspace(  1,  numspkrs,  numspkrs); 

%  Remove  current  speaker  from  the  list 
Dv  =  nixcol(Dv,  spkrnum); 

indexptr  =  nixcol (indexptr,  spkrnum); 

%  Sort  Dv  and  get  an  index  back 
[Dvnew,  index]  =  sort(Dv); 

%  Reorganize  indexptr  according  to  index 
indexptrnew  =  indexptr(index); 

%  Select  N  closest  speakers  (speakers  with  smallest  distortion) 

Ci  =  indexptrnew(l:Nc); 

% 

%  Step  0:  Move  the  closest  speaker  from  Ci  to  Bi 

% 

Bi  =  Ci(l); 

Ci  =  nixcol(Ci,  1  ); 

Nc  =  Nc  -  1; 

Bprime  =  1; 

% 

%  Step  1:  Move  speaker  c  from  Ci  to  Bi  until  Bprime  =  B 

% 

while  Bprime  <  B, 
tmp2  =  [0]; 

for  cloop  =  1  :  length(Ci), 
tmp  =  [Oj; 

for  bloop  =  1  :  length(Bi), 

tmp(bloop)  =  matrix(  Bi(bloop),  Ci(cloop)  )  —  matrix(  spkrnum,  . . . 


280 


290 


300 


310 


320 


330 
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Ci(cloop)  ); 

end;  %  for  bloop  =  1  :  length(Bi)  340 

tmp2(cloop)  =  sum(tmp)  /  Bprime; 

end;  %  for  cloop  =  1  :  length(Ci), 

Bprime  =  Bprime  +  1; 

Nc  =  Nc  -  1; 

%  Find  the  largest  c  and  move  from  Ci  to  Bi 

350 

[tmp3,  tmpindex]  =  sort(tmp2); 

Bi(Bprime)  =  Ci(  tmpindex(  length(tmpindex)  )  ); 

Ci  =  nixcol(  Ci,  tmpindex(  length(tmpindex)  )  ); 

%  Get  another  background  speaker  from  Ci  until  Bprime  =  B 
end;  %  while  Bprime  <  B, 

360 


% 

%  fndfchrt.m 

% 

%  [cohorts]  =  fndfchrt(distortionmatrix,  N,  B,  spkmum); 

% 

%  fndfchrt  finds  the  far  cohorts  for  a  given  distortion  matrix. 

% 

%  Inputs: 

%  distortionmatrix  a  square  matrix  of  distortion  scores 
%  N  size  of  pools  to  use  in  selecting  background  speakers 

%  B  size  of  background  speakers  for  maximally  spread  close 

%  spkmum  indicates  number  of  speaker  in  distortionmatrix 

% 

%  Ouput: 

%  Bi  vector  of  pointers  to  speakers  determined  to  be  cohorts 

% 

%  References: 

%  Reynolds  “Speaker  Identification  and  verification  using 
%  Gaussian  mixture  speaker  models.  “  Speech  Communication 

%  17  (1995)  p  91-108 

% 

%  Created  by  Gapt  R.  Brian  Reid 

% 

%  Created  on  14  Aug  1997 

% 

%  Last  Modified:  26  Aug  1997 
%  1138 

% 

function  Bi  =  fndfchrt  (matrix,  N,  B,  spkmum); 

Nc  =  N; 

Nf  =  N; 

numspkrs  =  size(matrix,2); 

%  Select  spkmum’ s  utterance  against  all  models 
Dv  =  matrix(spkrnum,  :  ); 

%  Create  an  index  of  all  of  the  speakers 
indexptr  =  linspace(  1,  numspkrs,  numspkrs); 

%  Remove  current  speaker  from  the  list 
Dv  =  nixcol(Dv,  spkmum); 


370 


380 


390 


400 
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indexptr  =  nixcol(indexptr,  spkrnum); 

%  Sort  Dv  and  get  an  index  back 
[Dvnew,  index]  =  sort(Dv); 

%  Reorganize  indexptr  according  to  index 
indexptrnew  =  indexptr(index); 

%  Select  N  further  speakers  (speakers  with  greatest  distortion) 
bsize  =  length(Bi); 

Fi  =  [  indexptrnew(  length(  indexptrnew  )  —  Nf  :  length(  indexptrnew  )  )  ]; 

% 

%  Step  0:  Move  the  furthest  speaker  from  Fi  to  Bi 

% 

Bprime  =  1; 

Bi(l)  =  Fi(Nf); 

Fi  ==  nixcol(Fi,  Nf  ); 

Nf  =  Nf  -  1; 

% 

%  Step  1:  Move  speaker  f  from  Fi  to  Bi  until  Bprime  —  B 

% 

while  Bprime  <  B, 
tmp2  =  [0]; 

for  cloop  =  1  :  length(Fi), 
tmp  =  [Oj; 

for  bloop  =  1  :  length(Bi), 

tmp(bloop)  =  matrix(  Bi(bloop),  Fi(cloop)  )  ♦  matrix(  spkrnum,  . . . 
Fi(cloop)  ); 

end;  %  for  bloop  =  1  :  length(Bi) 

tmp2  (cloop)  =  sum(tmp)  /  Bprime; 

end;  %  for  cloop  =  1  :  length(Fi), 

Bprime  =  Bprime  +  1; 

Nf  =  Nf  -  1; 

%  Find  the  largest  f 

[tmp3,  tmpindex]  =  sort(tmp2); 

Bi(Bprime)  =  Fi(  tmpindex(  1  )  ); 

Fi  =  nixcol(  Fi,  tmpindex(  1  )  ); 

%  Get  another  background  speaker  from  Fi  until  Bprime  =  B 
end;  %  while  Bprime  <  B, 

% 

%  fndindx.m 

% 

%  [indices]  —  fndindx(shortlist,  fulllist) 

% 

%  Find  the  indices  of  a  subset  from  a  complete  list 

% 


410 
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function  [indices]  =  fndindx(shortlist,  fulllist) 
indices  =  []; 

for  loop  =  1;  size(shortlist,l), 
for  loop2  =  1  :  size(fulllist,l), 

if  fulllist (loop2,:)  ==  shortlist(loop,;), 

indices(loop)  =  loop2;  490 

end; 

end;  %  loop2  =  1  :  lengthf fulllist), 
end;  %  for  loop  =  1:  lengthf shortlist), 

% 

%  loadphndb.m 

%  500 

%  function  phn=loadphndb; 

% 

%  Description:  This  function  reads  in  a  tabular  “data  base”  of  all  phoneme 
%  labels  and  their  classification  (voiced/ unvoiced)  allowed  in 

%  the  TIMIT  phoneme  files. 

% 

%  Author;  Capt  Al  Arb,  USAF 
%  Date:  29  Jul  96 
%  Modified; 

%  510 

%  Input  parameters: 

%  none 

% 

%  Output  Parameters: 

%  phn:  A  matrix  containing  each  possible  phoneme  label  (columns  1-4)  and 

%  its  classification  (“VOICED”  or  “UNVOICED”)  (columns  5-lS). 

% 

%  Subroutines  directly  called: 

%  none 

%  520 

%  Subroutines  indirectly  called: 

%  none 

% 

function  phn=loadphndb; 

% 

Yo  Save  current  directory  location  so  we  can  return  here. 

% 

chgdir=pwd; 

chgdir=['cd  '  chgdirj;  530 

% 

%  Go  to  location  of  “data  base”  file. 

%o 

Yo 

Yo  Open  file 
Yo 

fid=fopen( '  /home/hawkeye5/96d/harb/niatlab/thesis_code/timit_phoneme .  txt ' , '  r ' ); 

Yo 

Yo  Read  in  data  base  as  a  long  string  of  characters  540 

Yo 

p=setstr(fread(fid, '  char ' )); 

Yo 

Yo  Since  each  row/ phoneme  is  23  characters  wide  and  there  are  62  possible 
Yo  labels,  reshape  into  a  62x23  matrix. 

Yo 

p=reshape(p,23,62) ' ; 

Yo 
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%  Save  the  label  (columns  1-4)  and  VfUV  label  (15-23). 

% 

phn=[p(:,l:4)  p(;, 15:23)]; 

% 

%  Close  the  file 

% 

fclose(fid); 

% 

%  maincohort.m 

% 

%  Read  in  a  list  of  speakers’  scores  for  a  given  speaker’s  utterance 

% 

%  Set  up 

close  all 
clear  all 

N  =  20;  %  Size  of  pool  to  use  in  selecting  background  speakers 

Bt  =  10;  %  Total  number  of  background  speakers  to  use  as  close 
%  and  far  cohorts.  Should  be  an  even  number. 

corpus  =  ['timitmw'j; 

B  =  floor(Bt/2);  %  Number  of  far  or  close  cohorts  to  pick 
%  load  a  list  of  speakers 
%  For  actual 

%  speaklist  =  [’Ihome/fugglesl/rreid/speakerlist/ ’, corpus, ’/allspkr.lis’]; 
speaklist  =  ['  /home/fugglesl/rreid/speakerlist/timit/testspeaker.lis ']; 
speakers  =  readafil(speaklist); 

numspeakers  =  length(speakers); 

%  cohortsdir  =  [’J home) fuggleslirreidi toy/ Cohorts’]; 
cohortsdir  =  ['/home/fugglesl/rreid/Cohorts/',  corpus]; 

%  For  each  speaker  in  the  list 
starttime  =  cputime; 

for  loop  =  1  :  numspeakers, 

%  Read  in  the  list  of  scores  for  speakers’  models  for  speaker(loop)’s  sal 
oscorefile  =  [cohortsdir  '/'  speakers(loop,:)  '/salscores'  ]; 
oscores  =  readfil2  (oscorefile); 

%  Read  in  the  list  of  scores  for  speakers  ’  models  for  speaker(loop)  ’s  sa2 
sscorefile  =  [cohortsdir  '/'  speakers(loop,:)  VsaSscores'  ]; 
spkrlscores  =  readfil2  (sscorefile); 

%  Put  scores  into  the  proper  matrix 
salmatrix(  loop,  :  )  =  oscores' ; 
sa2matrix(  loop,  :  )  =  spkrlscores ' ; 
end;  %  for  loop  =  1  ;  numspeakers, 
stoptime  =  cputime; 
samatrixtime  =  stoptime  —  staittime; 

%  Save  the  sal  and  sa2  matrices  for  future  use  (just  in  case) 
eval  (['save  '  cohortsdir  '/samatrix  salmatrix  sa2matrix 
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'seunatrixtime'  ]) 
disp([' Determined  SA  matrices']) 

%  Determine  distortion  metric  scores 
starttime  =  cputime; 

[distortionmatrix]  =  distmtrx(salmatrix,  sa2matrix); 
stoptime  =  cputime; 

%  Save  the  distortion  scores 
distortiontime  =  stoptime  —  starttime; 

eval(['save  '  cohortsdir  '/dismtrx  distortionmatrix  distortiontime’  ]); 

%  Find  the  cohorts 

starttime  =  cputime; 

for  loop  =  1  :  numspeakers, 

ccohorts  =  fndcchrt (distortionmatrix,  N,  B,  loop); 

fcohorts  =  fndfchrt (distortionmatrix,  N,  B,  loop); 

cohorts  =  [ccohorts,  fcohorts]; 

ncohorts(loop,;)  =  cohorts; 

%  Save  the  cohorts  for  future  use 

cohorts  =  cohorts ' ; 

end;  %  for  loop  =  1  :  numspeakers, 

stoptime  =  cputime; 

cohorttime  =  stoptime  —  starttime; 

%  Save  the  entire  corpus’  cohorts  as  a  single  matrix 

eval(['save  '  cohortsdir  '/cohorts  ncohorts']) 
eval(|'save  '  cohortsdir  '/cohorttime  cohorttime']) 

% 

%  modwmfcc.m 

% 

%  This  m-file  creates  MFCGs  by  first  warping  the  utterance  according  to  a 
%  given  speaker’s  average  frequency  spectrum  (the  “Modified  Wiener”  approach). 
% 

clear  all 

eaddr  =  ['rreidahawkeye.afit.af.mil']; 

corpus  =  ['timit']; 
version  =  ['mw']; 

%  Which  type  original  db,  or  parameter 
typefile  =  ['orig']; 

emsg  =  ['  MFCCs  made  ']; 

%  emsg  =  [’  labelled  segments  made  from  ’  corpus  ’  ’]; 

uorv  =  ['u']; 
uorv  =  ['v']; 

%  IHcopy  -C  configuration-file  input-file  output-file 

configfile  =  ['/home/fugglesl/rreid/htkscripts/timithconfig']; 
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hcopyl  =  ['  IHCopy  -C  configfile  ]; 

basedir  =  ['/home/fuggles2'];  700 

savedirv  =  ['/home/fugglesl/rreid/mfcc/',  corpus,  version  ]; 

msrcdir  =  ['/horae/fugglesl/rreid/uttlists/',  corpus,  '/',  typefile]; 

modeldir  =  ['/home/fugglesl/rreid/FFTmodels']; 

regions  =  [1,  2,  3,  4,  5,  6,  7,  8]; 

regions  =  [  2];  710 

%  Set  flags 

% 

%  Set  flag  for  whether  to  make  the  unvoiced  portions  as  well  as  the  voiced  ones. 

% 

unvoicedflag  =  0; 
voicedflag  =  1; 

testonly  =  ['y']; 

720 

samplefrequency  =  16000; 
windowsize  =  0.020; 
fftsize  =  512; 

sf  =  samplefrequency;  %  Sample  frequency 

wlength  =  windowsize;  %  window  length 

fstep  =  [0.010];  %  frame  step  size 

param  =  0;  %  Setting  based  on  HTK  formats  0  =  wav 

phnfile  =  [];  %  dir  with  name  of  file  (*.phn)  with  phoneme  labels  730 

%  Note:  [’00000’]  does  not  work  properly  for  speakerold 
speakerold  =  ['zzzzz'j; 

%  Set  trotst  to  1  when  pulling  raw  data  from  the  test  regions 
trotst  =  1; 

while  trotst  <=  2, 

if  trotst  ==  2  740 

setdir  =  ['train']; 

else 

setdir  =  ['test ']; 

end 

regloop  =  1; 

for  regloop  =  1  :  length(regions), 

tmpdir  =['/home/fugglesl/rreid/tmpr',  nuni28tr(  regions(regloop)  )  ];  750 

regdir  =  ['dr',  nuni2str(  regions  (regloop)  )  ]; 

usedir  =[basedir,  '/',  corpus,  '/'  ,  setdir,  '/',  regdir  ]; 

getfile  =  [msrcdir,  '/',  regdir,  setdir,  'list.txt']; 

%  Open  the  list  of  utterances 

[fid,message]=fopen(getfile, '  r ' );  760 

%  Read  in  the  lines  of  getfile  until  reaching  end  of  file 
done=0; 
while  “done 

%  P  is  the  utterance  to  use 
P=fgetl(fid); 
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%  disp( ’accomplished  fgetl’) 

if  'isstr(P) 

%  Not  a  string  so  set  done  to  quit 
done=l; 

else 


%  P  is  a  valid  string  so  continue 
%  Create  a  string  to  use  for  storage 

p2  =  fliplr(P);  %  Reverse  the  sequence 

%  Get  utterance  and  extension  only  and  put  in  normal  order 

utterance  =  £liplr(  p2(  1  :  flnd(p2==  ’  / ' )  —  1  )  ); 

%  Get  just  utterance  name  the  extension 

utterance2  =  utterance(  1  ;  flnd(  utterance  ==  ' . '  )  —  1); 

region  =  regdir; 

%  Next  line  assumes  data  is  in  NIST  format  of 
%  corpus/ section/ region/ speaker/ utterance 
%  and  speaker  names  are  five  (5)  characters  long 

speaker  =  P(  length(P)— (length(utterance)  +  5)  ;  ... 
length(P)— (length(utterance)  +  1)  ); 

usefile  =  [  tmpdir,  utterance2,  '.swa'  ]; 

%  To  pull  from  the  actual  data 

usefile2  =  [  usedir,  speaker,  utterance  ]; 

%  To  pull  from  locations  other  than  actual  data 
%  usefileS  =  [  usedir,  ’/  speaker,  ’/  ’,  corpus,  ’/  ’, 

%  utterance  ]; 

temp  =  [tmpdir,  '/',  utterance2  ]; 

%  Remove  the  header  information  in  order  to  process  in  MatLab 

%  Remove  the  NIST  header  information  and  byte  swap  the  file 
eval  ([  '  !bhd  '  usefile2  '  I  dd  conv=swab  of='  temp  '.swa'  ]) 

%  Read  in  the  file  for  use 

temp2  =  [temp  '.swa']; 

data  =  read_dat(temp2,' short'); 

numsamples  =  meix(size(data)); 

%  Load  in  the  FFT  model 
if  speaker  *=  speakerold 
%  Loads  speaker’s  model  as  avgfft 
eval(['load  '  modeldir  '/'  speaker]); 
speakerold  =  speaker; 
disp  (speaker) 
end; 

%  Warp  the  utterance  according  to  the  appropriate 
%  speaker’s  spectra 

%  Calculate  the  average  DFT  of  the  utterance 
[uttavgfFt]  =  fftavgf(data,  iftsize,  samplefrequency,  . . . 
windowsize); 

%  Compute  the  Modified  Wiener  impulse  response 
immw=  avgfft  ./  uttavgfft; 
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htmmw  =  flE'tshift(  real  (  ifft(  immw,  fiFtsize)  )  ); 

%  Convolve  the  original  utterance  with  the  Modified  Wiener 
%  impluse  response  (which  hopefully  diminishes  the  channel 
%  effects) 

newdatammw  =  conv(data,  htmmw); 

%  Extract  the  voiced  and  unvoiced  portions 

phnfile  =  [  usedir,  speaker,  utterance2,  '.phn'  ]; 

[voiced,  unvoiced]  =  exvu4htk(newdatammw,usefile,sf,wlength,fstep, phnfile) ; 
%  Write  the  voiced  part  to  a  file 
if  (voicedflag  ==  1), 

temp4  =  [tmpdir,  utterance2,  '  .htk'j; 
w.error  =  whtkwav  (voiced,  temp4,  sf,  param); 

if  w.error  "=  0, 

disp([' error  writing  ',  temp4]) 

%  else 

%  dispd’OK’]) 

end; 

end;  %  if  (voicedflag  ==  1), 

%  Write  the  unvoiced  part  to  a  file 
if  (unvoicedflag  ==  1), 

temp4  =  [savediru,  '/',  speaker,  '/',  utterance2,  ’.htk’j; 
w.error  =  whtkwav(unvoiced,  temp4,  sf,  param) 

end;  %  if  (unvoicedflag  ==  1), 

%  Calculate  the  MFCCs 

hcopy  =  [hcopyl,  '  ',  temp4,  ’  ’,  savedirv,  ’/’,  speaker,  '/’,  ... 
utterance2,  ’.mfc’]; 

eval(hcopy) 

%  Clean  up  temporary  files 
eval([’!rni  ',  tmpdir,  '/',  utterance2,  ]  ) 

end;  %  end  if  "isstr(P) 
end;  %  end  while  ‘done 

emsg2  =  ['dr  ’,  num2str(regions(regloop)),  ’  ’,  emsg,  ’  for  ',  ... 
setdirj; 

emailmsg(eaddr,emsg2); 

fclose(fid); 

end;  %  end  for  regloop 

trotst  =  trotst  +  1; 

if  testonly  ==  ['y’j 
trotst  =  3; 

end; 

end;  %  while  trotst  <=  2, 
disp([’Done’]) 

% 
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% 

%  [outvect,  count] =read-dat(infile, datatype); 

% 

%  Input: 

%  infile  -  filename  of  file  to  read  in 
%  datatype  -  format  of  data,  e.g.,  char  or  long 
% 

%  Output: 

%  outvect  -  vector  of  values  from  infile 
%  count  -  number  of  elements  in  outvect 
% 

function  [outvect,  count]  =read_dat (infile, datatype); 
[fid,message]  =  fopen(infile, 'r'); 

[outvect, count]  =  fread(fid, datatype); 
fclose(fid); 

% 

%  readafil.m 

% 

%  [list]  =  readafil(infile) 

% 

%  Read  in  a  text  file  and  convert  its  contents  into 
%  a  MATLAB  variable 
% 

function  [list]  =  readafil(infile) 

[fid]  =  fopen(infile,  'r'); 

done  =  0; 
list  =  []; 
count  =  0; 

while  'done 
temp  =  fgetl(fid); 
if  (  isstr(temp)  ), 
count  =  count  -i-  1; 

%  list  =  [list;  temp]; 
list  (count,  :  )  =  [temp]; 
else 

done  =  1; 
end; 
end; 

fclose(fid); 

% 

%  readfil2.m 

% 

%  [list]  =  readfil2(infile) 

% 

%  Read  in  2nd  column  of  a  text  file  and  convert 
%  to  a  MATLAB  variable.  Assumes  7  characters 

%  before  the  first  character  of  second  column 

% 

function  [list]  =  readfil2 (infile) 

[fid]  =  fopen(infile,  'r'); 

done  =  0; 
list  =  []; 
count  =  0; 


while  'done 
temp  =  fgetl(fid); 
if  (  isstr(temp)  ), 
count  =  count  +  1; 

%  list  =  [list;  temp]; 

list(count,  :  )  =  [str2num(temp(  :  ,  7:length(temp)  )  )  ]; 
else 

done  =  1;  990 

end; 

end; 

fclose(fid); 


% 

%  readhtkn.m 

% 

%  [data,  numsamples]  =  readhtknf filename, htkformat); 

% 

%  Read  HTK  2.0  files  into  MatLab. 

% 

%  Potential  Bug: 

%  This  was  designed  to  read  HTK  waveform  files. 

%  For  other  types  of  files,  some  adjustments  of  the 

%  fread  size  may  need  to  be  made. 

% 

%  Created  by  Capt  R.  Brian  Reid 
%  19  Aug  1997 

% 

%  Modified  16  Sep  1997  to  return  the  number  of  samples 
function  [data,  numsamples]  =  readhtkn(filename,htkformat) 

[fid,  errmsg]  =  fopen(filename,  'r'); 

%  errmsg 

%  Read  in  the  head  information 

numsamples  =  fread(fid,l,'int32') 
sampleperiod  =  fread(fid,i,'int32') 
samplesize  =  fread(fid,l,'intl6') 
paramkind  =  fread(fid,l,'intl6') 

if  htkformat  ==  0, 
datatype  =  ['intl6']; 
else 

datatype  =  ['float32']; 
end; 

%  disp([’ datatype  is  ’  datatype  ]) 

%  Read  in  the  actual  data 

data  =  fread(fid,  numsamples,  datatype); 

%  Close  the  file 
fclose(fid); 

%  readphn.m 

% 

%  function  [a,b]=readphn(phnfile); 

% 

%  Description:  This  function  reads  in  the  hand  labelled  TIMIT  phoneme  file  and 
%  returns  a  matrix  of  phonmeme  labels  and  a  matrix  of  phoneme 

%  start  points  and  end  points. 

% 
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%  Author:  Capt  Al  Arb,  USAF 
%  Date:  29  Jul  96 
%  Modified: 

% 

%  Input  parameters: 

%  phnfile:  The  name  of  the  TIMIT  phoneme  file. 

% 

%  Output  Parameters: 

%  a:  A  matrix  containing  the  starting  point  and  ending  points  of  each  1060 

%  phoneme  in  columns  1  and  2  respectively.  Each  row  is  a  different 

%  phoneme. 

% 

%  h:  A  matrix  of  phoneme  labels.  Each  row  is  a  different  phoneme. 

% 

%  Subroutines  directly  called: 

%  none 

% 

%  Subroutines  indirectly  called: 

%  none  1070 

% 

function  [a,b]=readphn(phnfile); 


%  Open  TIMIT  phoneme  file  for  reading. 

% 

[fidphn,  phnmsg]  =  fopen(phnfile,'r'); 

1080 

%  phnmsg 

% 

%  Continue  to  read  until  reaching  the  end-of-file. 

% 

while  ~feof(fidphn) 

% 

%  Get  one  line  of  the  file  as  a  string. 

% 

s=fgetl(fidphn);  1090 


% 

%  Check  to  see  if  it’s  the  end  of  file,  if  not,  continue  to  process. 

% 

if  s'=(-l) 

% 

%  Break  up  string  into  2  integers  and  a  string. 

% 

p=sscanf(s,"/,i  */,i  ’/.s');  1100 

% 

%  The  two  integers  are  the  start  point  and  end  point  of  the  phoneme. 

% 

a=[a;p(l)  p(2)]; 

% 

%  Set  the  numerical  version  of  the  label  string  to  an  actual  string. 

p=setstr(p(3:length(p))) ' ;  1110 

% 

%  Prepend  the  string  with  spaces  to  bring  length  of  string  to  4  'u>ith  the 
%  label  right  justified. 

% 

if  length(p)==2 

p=['  '  p]; 

elseif  length(p)==3 
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p=['  '  p]; 

elseif  length(p)==l 

p=['  '  p]; 

end; 

% 

%  add  new  label  to  b  matrix 

% 

b=[b;p]; 

end; 

end; 

fclose(fidphn); 

% 

%  spkrid.m 

% 

%  This  file  pulls  in  utterance  scores  for  all  speakers  and  determines  the 
%  identity  of  the  speaker  based  on  highest  score  by  GMM  model. 

% 

clear  all 

corpus  =  ['timit']; 
version  =  ['cmsbl']; 

saveflag  =  ['y']; 

tgtfile  =  ['/home/fugglesl/rreid/Results/'  corpus  version  '/idsanew']; 
idfile  =  ['/home/fugglesl/rreid/Results/'  corpus  version  '/idsa']; 

cohorts  =  []; 

%  Get  the  needed  info 
%  for  actual 

speaklist  =  ['/home/fugglesl/rreid/speakerlist/',  corpus,  '/alltsttr.lis']; 

eval(['load  /home/fugglesl/rreid/Cohorts/timit '  version  '/cohorts']); 

speakers  =  readafil(speaklist); 
numspeakers  =  length(speakers); 

ncohorts  =  cohorts; 

%  Zero  out  the  confusion  matrix 
cmatrix  =  zeros(numspeakers); 

%  Read  in  a  list  of  files  to  check 

%  Get  the  scores  and  put  into  the  proper  form 

file2read  =  ['/home/fugglesl/rreid/Results/'  corpus  version  '/sascores.txt ']; 

fid  =  fopen(file2read, 'r'); 

done  =  0; 
correct  =  0; 
errors  =  0; 
count  =  1; 
spkrnum  =  1; 
filenum  =  1; 

while  "done 
nowfile  =  fgetl(fid); 

if  isstr(nowfile) 

%  Load  in  scores  for  a  given  utterances 
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%  Use  for  MatLab  files 
%  eval([’load  ’  nowfile  ]); 

%  scores  =  sxscores; 

%  Use  for  ASCII  files 
scores  =  readfil2  (nowfile) ' ; 

Iscores  =  max(size(scores)  ); 

scores  =  scores (  Iscores  —  numspeakers  +  1  :  Iscores); 

%  Determine  the  winner 

[tmp,  tmpindex]  =  sort(  scores  ); 
winner  =  tmpindex(length(tmpindex)  ); 
temp  =  fliplr(  nowfile  ); 
tempi  =  flnd(  temp  ==  ); 

realspkr  =  fliplr(  temp(templ(l)  +  1  :  templ(2)  —  1  )  ); 
realspkrnum  =  fndindx(realspkr,  speakers); 

%  Update  the  confusion  matrix  based  on  which  speaker  made  the  utterance 
%  and  which  speaker  had  the  highest  score  for  the  utterance 

cmatrix(realspkrnum,  winner)  =  cmatrix(realspkrnum,  winner)  +  1; 

if  realspkrnum  ==  winner 
correct  =  correct  +  1; 
else 

errors  =  errors  +  1; 
end; 

%  disp([’Finshed  file  ’  num2str(filenum)  ]) 
filenum  =  filenum  +  1; 
else 

done  =  1; 

end;  %  if  is str (nowfile); 

%  Get  the  next  utterance 
end;  %  while  ‘done 
fclose(fid) 

correct 

errors 

total  =  correct  +  errors 

percorrect  =  correct/total+lOO 

pererror  =  errors/total*100 

correct2  =  sum(diag(cmatrix)) 

%  save  the  identification  results 

if  savefiag  ==  ['y'] 
eval(['save  '  idfile  ]) 
end; 

% 

%  verifhtkB.m 

% 

%  [score]  =  verifhtk2(speaker,  cohorts,  speakerlist,  uttscores) 

% 

%  Script  to  perform  verfication  of  a  for  a  given  utterance  using  utterances 
%  already  calculated  by  HTK  2.0 
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% 

%  Inputs: 

%  speaker  number  of  speaker  as  determined  by  speakerlist 
%  cohorts  vector  of  cohorts  containing  integer  values  of  cohorts  in 

%  speakerlist 

%  speakerlist  list  of  all  speakers 

%  uttscores  scores  for  all  models  for  a  given  utterance 

% 

%  Output: 

%  score  normalized  log  probability  score  of  speaker  given  cohorts 

% 

function  [score]  =  verifhtk2  (speaker,  cohorts,  speakerlist,  uttscores) 

%  Calculate  the  score  for  speaker 
oscores(l)  =  uttscores(speaker); 

%  Calculate  the  scores  for  the  cohorts  and  append  them  to  the  file 
%  containing  the  speaker’s  score 

for  cloop  =  1  :  length(cohorts), 

oscores(cloop  +  1)  ==  uttscores(cohorts(cloop)  ); 

end; 

%  Determine  the  actual  score  based  on  score  —  speakerscore  - 
%  sum  (cohorts  cores) /if  cohorts 

score  =  oscores(l)  —  sum(oscores(  2  ;  length(oscores)  )  )  /  length(cohorts); 
%  end  of  verify 
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% 

%  whtkwav.m  1300 

% 

%  function  w-error  =  whtkwavfdata, filename, sampleperiod,param) 

% 

%  Writes  waveform  data  to  ’filename’  in  HTK  standard  binary  format. 

%  The  data  is  written  with  the  appropriate  12  byte  header  together  with 
%  the  data  in  the  proper  byte  format. 

% 

%  Ensure  the  data  is  passed  as  a  matrix  with  each  row  corresponding 
%  to  a  frame,  and  each  column  contains  the  parameter  (fft  spectra,  etc) 

%  1310 

%  Originally  from  Al  Harb’s: 

%  function  w-error  =  write-HTK-param(data,filename) 

% 

%  Modifications  allow: 

%  Varying  sample  periods 

%  writing  back  into  the  desired  parameter  format 

%  Use  0  for  waveform  6  for  HTK  MFCC  and  9  for  user  defined 

%  Reference  HTK  V2.0  manual  page  73 

% 

%  Modifications  by  R.  Brian  Reid  1320 

%  Modified:  20  Aug  1997  1350 

% 

function  w.error  =  whtkwav(data,  filename,  samplefreq,  param) 

fid  =  fopen(filename, 'w' ); 
if  (fid  ==  —1); 

error('Unable  to  open  the  file  to  write  HTK  paramter  data'); 
end 

%  Check  the  number  of  input  arguments  1330 

if  nargin  <  3, 

%  Use  the  defaults 
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samplefreq  =  16000; 
param  =  0; 
end;  %  if  nargin  <  3 

%  Compute  sample  period  from  sample  frequency  in  terms  of  1 00ns 

sampleperiod  =  ceil(  1  /  samplefreq  /  (  100  ♦  10*(— 9)  )  ); 

%  Determine  the  number  of  bytes  per  sample 

if  param  ==  0, 
bytespersamp  =  2; 
datasize  =  ['intl6']; 
numsamples  =  length(data); 
else 

bytespersamp  =  4*size(data,2); 
datasize  =  ['float32']; 
numsamples  =  size(data,2); 

end; 

%  Write  the  header 

%  Write  the  number  of  samples  in  the  file  (4-byte  header) 
fwrite(fid, numsamples, '  int32 ' ); 

%  Write  the  sample  period  in  100ns  units  (4-byte  integer) 
fwrite(fid, sampleperiod, '  int32 ' ); 

%  Write  the  number  of  bytes  per  sample 
%  need  a  4-byte  float  for  each  paramter  in  the  feature  vector 
fwrite(fid, bytespersamp, '  intl6 ' ); 

%  Write  code  indicating  the  sample  kind  (2-byte  integer) 

%  Use  0  for  HTK  waveform,  6  for  HTK  MFGG,  and  9  for  user  defined 
fwrite(fid,param, 'intl6');  %  User  defined  parameters  data  flag 

%  Write  the  data  into  file  in  datasize  chunks 
fwrite(fid,data,datasize); 

fclose(fid); 
w.error  =  0; 
return; 
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B.2  C-Shell  Scripts  ^ 


tt//bin/csh 

#  gmm2maker 

#  UNIX  C-shell  script  for  creating  Gaussian  Mixture  Models  (GMM)  using  HTK  2.0 
it 

it  Currently  set  for  full  TIMIT 
Hit 

it  Output  Files  of  HMMs  for  each  speaker  in  the  speaker  list 
it  Assumptions:  Using  voiced  speech  (only  in  terms  of  file  locations) 

a  10 

set  corpus  =  timit 
set  version  =  cmsbl 

set  eaddrl  =  "rreidQhawkeye.afit.af .mil” 
set  eaddr2  =  "rreid" 

set  emsg  =  "GMMs  have  been  created  for  $corpus  region  $1” 


^  Lines  beginning  with  -  S  should  be  made  into  a  continuation  of  the  proceeding  line. 
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set  hconfig  =  /home/fugglesl/rreid/htkscripts/hconfig$version 

set  mlf  =  /home/fugglesl/rreid/LABEL/tmEister.mlf 

set  spkrlab  =  /home/fugglesl/rreid/toy 

set  tmpdir  =  "/home/hawkeyel2/97d/rreid/SID/SID'' 

cd  Stmpdir 

set  spkrlist  =  /home/fugglesl/rreid/speakerlist/$corpus/trainspeaker{$l}.lis 

set  hmmdir  =  "/home/fugglesl/rreid/tmpr"{$l} 

#  Source  directory  for  liata  of  utterancea  for  each  apeaker 

set  srcdir  =  /home/fugglesl/rreid/uttlists/{$corpus}{$version} 

set  mfcctgtdir  =  /home/fugglesl/rreid/hmm 

set  mfcctgtdirl  =  $mfcctgtdir/$corpus$version 

if  (!  — d  Smfcctgtdirl  )  then 
mkdir  Smfcctgtdirl 
chmod  775  Smfcctgtdirl 
endif 

foreach  speaker  (‘cat  Sspkr^ist') 

set  mfcctgt  =  Smfcctgtdirl"/"Sspeaker 
set  trainfile  =  "$srcdir/$speaker/sasisx3.tra" 
echo  Sspeaker  >  Shmmdir/hmmlist 
echo  Sspeaker 

echo  "MU  2  {Sspeaker. state [2]  .mix}"  >  Shmmdir/edcdlisl 

if  (!  — d  Shmmdir/hmm.O  )  then 
mkdir  Shmmdir/hmm.O 
chmod  774  Shmmdir/hmm.O 
endif 

if  (  — d  Shmmdir/hmm.l  )  then 
rm  — fr  Shmmdir/hmm.l 
endif 

mkdir  Shmmdir/hmm.l 
if  (  !  — d  tmp  )  mkdir  tmp 

if  (  — e  Shmmdir/hmm.l/Sspeaker  )  then 
rm  — f  Shmmdir/hmm.l/Sspeaker 
endif 

if  (  — e  Shmmdir/hmm.O/Sspeaker  )  then 
rm  — f  Shmmdir/hmm.O/Sspeaker 
endif 

ttHInitHInit  -G  Shconfig  -i  15  -L  Sapkrlab  -v  0.01  -o  Sapeaker  -M  $hmmdir/hmm.O 
it  -S  StrainfUe  protogmmS 

HInit  — C  Shconfig  — i  15  — L  Sspkrfab  — v  0.01  — o  Sspeaker  — M  Shmmdir/hmm.O 

— S  Strainfi^e  protogmmcms 

HRest  — C  Shconfig  — i  15  — L  Sspkr^ab  — v  0.01  — M  Shmmdir/hmm.l 
— S  Strainfi^e  Shmmdir/hmm.O/Sspeaker 

cp  Shmmdir/hmm.l/Sspeaker  Shmmdir/hmm.O/Sspeaker 
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HHEd  -C  Shconfig  — d  $hmmdir/hmm.O  — M  $hmmdir/hmm.l  Shmmdir/edcdlisl  Shmmdir/hmmlist 
cp  $hmmdir/hmm.l/$speaker  $hmmdir/hmm.O/$speaker 
®  maxNum  =  32 

®  numiter  =  (${maxNum}  —  2  )  /  2 

®  loop  =  1  100 

®  index  =  2 


while  (  ${loop}  <=  {$nu miter}  ) 

®  index  =  Sindex  +  2 

set  mixturecmd="MU  $index  " 

echo  Smixturecmd  "{$speaker.state[2]  .mix}”  >  $hmmdir/edcmd 
echo  Smixturecmd 

HHEd  — C  Shconfig  — d  Shmmdir/hmm.O  — M  Shmmdir/hmm.l  Shmmdir/edcmd  Shmmdir/hmmlist 

cp  Shmmdir/hmm.l/Sspeaker  Shmmdir/hmm.O/Sspeaker 

HRest  — C  Shconfig  — i  15  — L  Sspkr^ab  — v  0.01  — M  Shmmdir/hmm.l 
— S  Strainfi^e  Shmmdir/hmm.O/Sspeaker 

cp  Shmmdir/hmm.l/Sspeaker  Shmmdir/hmm.O/Sspeaker 

®  loop  d-d- 

end 

cp  Shmmdir/hmm.l/Sspeaker  S{mfcctgt} 
chmod  774  S{mfcctgt} 

#  loop  to  get  the  next  speaker 

rm  Shmmdir/hmm.O/Sspeaker 
rm  Shmmdir/hmm.l/Sspeaker 

end 

#  Release  the  HTK  license 
Hfree 

#  Notify  user  that  GMMs  have  been  created 
cd  *rreid /matlab /thesis /tools 

mailer. c  Seaddrl  Semsg 


no 


120 
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140 


#  !/bin/csh 

#  uttscores2.c 

# 

#  UNIX  G-shell  script  to  determine  scores  for  all  models 
$  for  a  given  utterance  using  the  Viterbi  algorithm 
it  in  HVite. 

« 

a  Input:  $1  is  the  region  to  determine  scores  for 

a 

a  Variables  to  set: 

» 

$  srcdir  Path  for  Speaker  list,  sets  Results, 

it  srchmm  Path  for  location  of  hmms 
it  tgtdir  Path  to  place 

a  scriptdir  name  of  directory  containing  directories  of  sa,  si,  sx,  all.tra  lists 

a  uttdir  Path  for  actual  parametized  utterances 

set  eaddr  =  rreid®hawkeye. afit.af.mil 

set  uttfile  =  sa.tra 
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set  region  =  {$1} 
set  corpus  =  timit 
set  version  =  cmsbl 

set  emsg  =  "Probability  scores  calculated  for  $corpus$version.  " 
set  mastermlf  =  /home/fugglesl/rreid/LABEL/{$corpus}master.mlf 
set  uttlistdir  =  /home/fugglesl/rreid/uttlists/{$corpus}{$version} 
set  hconfig  =  /home/fugglesl/rreid/htkscripts/hconfig{$version} 

#  For  final  problem 

set  srcdir  =  /home/fugglesl/rreid 

set  srchmm  =  /home/fugglesl/rreid/hmm/timit$version 
set  srcdir2  =  /home/fugglesl/rreid/uttlists/{$corpus}{$version} 

set  hmmdir  =  Ssrchmm 

set  tgtdir  =  /home/fugglesl/rreid/Cohorts/{$corpus}{$version} 

set  uttdir  =  /home/fugglesl/rreid/mfcc/{$corpus}{$version} 

set  speakerlist  =  /home/fugglesl/rreid/speakerlist/timit/speaker{$region}s.lis 

set  aJlspeakerlist  =  /home/fugglesl/rreid/speakerlist/$corpus/trainspeaker.lis 

#  Needed  for  HTK  V2.0  (apeakerdic  and  apknet  can  he  the  same  for  all  corpi) 
set  speakerdic  =  /home/fugglesl/rreid/Networks/timit /timit. die 

set  spknet  =  /home/fugglesl/rreid/Networks/timit 

#  Check  to  ensure  the  appropriate  directory  exists 
if  (!  -d  {Stgtdir}  )  then 

mkdir  {Stgtdir} 
chmod  774  {Stgtdir} 
endif 

foreach  spk  (‘cat  Sspeakerfist*) 

#  Ensure  speakers  directory  exists 
if  (!  — d  {Stgtdir}/{Sspk}  )  then 
mkdir  {Stgtdir}/{Sspk} 
chmod  774  {Stgtdir}/{Sspk} 
endif 

set  Resultsdir  =  Stgtdir/Sspk 

#  Save  speaker2  and  score  for  speaker  Is  utterance 
foreach  spk2  (‘cat  Saffspeaker^ist‘) 

#  For  each  speaker,  spk2,  get  a  score  for  each  utterance 

#  File  name  containing  current  hmm 

printf  "'/,s\n"  Sspk2  >  Shrnmdir/hmmlist{Sregion} 

foreach  utterance  (‘cat  Sutt^istdir/Sspk/Suttfife‘) 

set  utter  =  {Sutterance} 

set  utter2=‘basename  Sutter‘ 
set  utter2=Sutter2:r 

set  tgtfile  =  {$utter2}scores 

set  results  =  {Stgtdir}/{Sspk}/{Stgtfi£e} 
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printf  "'/,s\t"  $spk2  >>  {$resu£ts} 

HVite  -C  Shconfig  —a  — d  Ssrchmm  —I  $masterm£f  — y  svd  —1  SResuftsdir  — o  N  100 

— w  {$spknet}/{$spk2}.net  Sspeakerdic  {$hmmdir}/hmmlist{$region}  Sutter 

awk  '{printf  ("'/,s\n", $4) {$Resu^tsdir}/{$utter2}.svd  »  {Sresu^ts} 

rm  {$Resu^tsdir}/{$utter2}.svd 

#  Get  next  utterance 

end 

110 

#  Get  next  speaker2 

end 

#  Get  next  speakerl 
end 

set  emsg2  =;  "Region  $region  $emsg" 

120 

cd  “rreid/matlab/thesis/tools 
mailer. c  Seaddr  $emsg2  Sresute 
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